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Abstract 


This  thesis  outlines  the  need  for  a  siaple 
language-independent  model  for  string  patterns,  and  devslops 
in  detail  an  algebraic  model  meeting  these  criteria. 
Following  Gimpel  [1973],  we  define  patterns  as  functions  of 
contexts,  each  consisting  of  a  subject  string  indexed  by  an 
integer  pre-cursor  position.  We  re-define  the  value  of  a 
pattern  match  to  be  a  counted  set  of  post-cursor  positions, 
and  define  alternation  and  concatenation  as  pattern-valued 
operations  on  patterns,  A  ring  structure  then  results  from 
the  introduction  of  the  concept  of  negative  patterns.  The 
thesis  explores  a  theory  and  interpretation  of  semi-inverse 
patterns  under  concatenation,  and  develops  a  mathematical 
semantics  for  recursive  pattern  definitions  using  least 
fixed  point  technigues. 

We  demonstrate  the  utility  of  the  model  by 
incorporating  it  in  the  design  of  some  relevant  features  of 
a  string  manipulation  language,  and  outline  an  approach  to 
the  design  of  future  string  manipulation  languages.  The 
thesis  discusses  some  implementation  technigues  and 
efficiency  considerations,  and  suggests  directions  for 
future  research. 
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1*.  Introduction 


A  great  deal  has  been  learned  about  programming  and 
programming  language  design  during  the  past  ten  years.  The 
concepts  of  structure,  protection  and  abstraction  are  having 
profound  effects  on  the  way  languages  are  designed, 
Unfortunately,  however,  the  progression  from  FORTRAN  to 
PASCAL  and  SIMULA  and  beyond  has  no  parallel  in  the 
development  of  string  manipulation  languages. 

It  was  the  paucity  of  powerful  string  manipulation 
tools  that  lead  to  the  invention  of  SNOBOL 
[Farber  et  al,  1964].  Originally  designed  as  a  tool  for  the 
development  of  a  specific  set  of  symbol  manipulation 
routines,  SNOBOL  was  received  with  widespread  interest  by 
people  with  a  great  variety  of  string  manipulation  problems. 
The  development  of  subsequent  versions  of  the  language, 
culminating  in  SN0B0L4  [Griswold  et  al,  1971],  proceeded  in 
an  ad  hoc  fashion,  with  features  being  added  and  deleted  as 
experience  with  the  language  accumulated.  The  introduction 
of  third-generation  hardware  provided  an  opportunity,  not 
overlooked  by  SNOBOL* s  designers,  to  stem  a  rising  tide  of 
dialects  by  designing  and  implementing  SNOBOL4 
[Griswold  1972],  Applications  were  found  in  such  areas  as 
linguistics,  symbolic  mathematics,  theorem  proving,  graph 
theory,  text  preparation,  mechanical  language  translation, 
and  music  analysis. 
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Although  SN0B0L4  is  dominant  in  the  field,  other 
string  manipulation  languages  have  been  developed,  among 
them  AXLE  [Cohen  and  Wegstein  1965],  PANON  [di  Forino  1968], 
and  COMIT  [Yngve  1972].  Each  of  these  languages  provides 
facilities  for  the  grouping,  rearranging,  inserting, 
deleting,  sorting,  testing,  tagging,  and  counting  of 
strings.  A-CcordingLy ,  they  share  a  common  set  of  design 
difficulties,  characteristic  of  this  class  of  languages. 

The  central  problem  in  designing  string 
manipulation  languages  is  to  find  a  set  of  basic  concepts 
for  building  string  composition  and  transformation 
procedures,  which  appear  quite  different  and  more  general 
than  ordinary  algebraic  computations. 

"Unfortunately,  no  standard  notation  or 
accepted  system  of  operations  exists  for 
string  manipulations." 

Farber  et  al.  [  1964  ]. 

The  difficulty  arises  largely  from  the  weak  algebraic 
structure  associated  with  the  monoid.  A*,  of  strings 
generated  from  an  alphabet.  A,  under  the  fundamental 
operation  of  concatenation.  In  particular,  many  useful 
mappings  from  tuples  of  strings  in  A*  to  elements  of  A*  are 
not  expressable  in  terms  of  concatenation  alone. 

The  second  fundamental  operation  of  string 
manipulation  is  substring  extraction.  This  can  be  done 
deterministically,  by  specifying  the  first  character 
position  and  length  of  the  substring  to  be  extracted,  or 
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non-ieterministically ,  by  a  process  of  pattern  matching,.  A 
numbar  of  formal  systems  exist  for  the  specification  of 
pattern  matching  processes,  including  Post  systems 
(de  scribed  in  [Minsky  1967]),  and  Markov  normal  algorithms. 

"It  was  Markov's  thesis  that  any  algorithmic 
activity  involving  the  manipulation  of 
characters  . , .  could  be  specified  as  a 
normal  algorithm." 

Caller  and  Perils  [1970] 

A  Markov  normal  algorithm  consists  of  a  seguance  of 
rules  for  the  replacement  of  characters  from  a  string, 
stored  in  a  £g.3ister,  by  other  characters.  A  number  of 
variants  and  extensions  to  normal  algorithms  exist 
[di  Forino  1968],  but  in  general,  a  rule  consists  of  a  left- 
hand  side,  specifying  the  substring  of  the  register  contents 
that  is  to  be  replaced,  and  a  right-hand  side,  specifying 
the  replacement  string.  The  intent  of  a  rule  is  that  the 
left-hand  side,  or  fiattern,  be  matched  in  some  fashion 
against  the  register  contents.  If  a  substring  as  specified 
by  the  pattern  is  found,  the  rule  is  said  to  be  applicable, 
or,  equivalently,  the  pattern  match  is  said  to  have 
succeeded.  Otherwise,  the  pattern  match  has  failed,  and  the 
rule  is  inapplicable.  If  the  rule  is  applicable,  the  string 
specified  by  the  right-hand  side  replaces  the  substring 
matched  by  the  pattern. 

Unfortunately,  Markov  algorithms  are  not  in 
themselves  a  very  practical  tool  for  string  manipulation 
applications.  The  widespread  difficulty  in  learning  and 
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using  Markov  algorithm  based  languages  is  summarized  in  a 
comment  by  C.  Strachey  at  the  IFIP  Working  Conference  on 
Symbol  Manipulation  Languages,  1968: 

"I  think  one  of  the  troubles  about  Markov 
algorithm  type  languages,  at  least  for  my 
purposes,  is  that  there  is  a  rather  small 
number  of  problems  in  which  my  mind  works 
like  a  Markov  algorithm." 

We  feel  that  this  problem  can,  to  a  great  degree, 
be  alleviated  by  embedding  the  pattern  matching  and 
replacement  features  of  Markov  algorithms  into  langaages 
that  encourage,  and  as  far  as  possible  enforce,  the  concepts 
of  structured  programming.  Perhaps  the  most  important  point 
in  this  thesis  and  in  [Stewart  1973],  is  that  string 
manipulation  languages  are  not  a  special  case,  and  the 
ongoing  research  into  the  principles  of  programming  language 
design  should  not  be  ignored  by  the  designers  of  string 
manipulation  languages.  We  feel  that  much  of  the  current 
research  aimed  at  adding,  deleting,  and  redefining  features 
of  SN0B0L4  [Abrahams  1974;  Doyle  1973;  Druseikis  and 
Griswold  1973;  Griswold  1973]  is  misguided.  Despite  its 
success,  no  one  believes  that  FORTRAN  will  evolve  into 
UTOPIA/84  [ Knuth  1974]  by  a  process  of  modification,  ani  it 
is  unreasonable  to  expect  that  SNOBOL4  will  be  able  to  do 
the  same.  New  string  manipulation  languages  should  be 
designed,  borrowing  from  the  strengths  of  SNOBOL  and  related 
languages,  but  unfettered  by  existing  implementations  and 
compa tability  constraints. 
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One  of  the  most  pressing  problems  in  the  design  of 
better  string  manipulation  languages  arises  in  the 
specification  of  patterns.  SN0B0L4  patterns,  although 
extremely  flexible  and  powerful,  are  notoriously  difficult 
to  explain  and  use.  We  feel  that  this  is  due  primarily  to 
the  complexity  of  the  interpretive  model  used  for  pattern 
matching  in  SN0B0L4.  In  particular,  the  programmer  must  be 
intimately  familiar  with  the  operation  of  the  pattern 
matching  procedure,  called  the  "scanner",  including  its 
backtracking  and  pattern  matching  heuristics.  Coyer  [1973] 
has  developed  a  theoretical  model  to  distinguish  problems 
inherent  in  SN0B014  pattern  matching  from  those  inherent  in 
its  implementation.  However,  Coyer* s  model  is  intended  to 
aid  in  the  design  of  improved  SN0B0L4  implementations,  not 
to  clarify  and  simplify  the  semantics  of  patterns  and 
pattern  matching.  It  is  to  the  latter  problem  that  the 
algebraic  model  deveLnped  in  this  thesis  is  addressed. 

In  chapter  2  a  number  of  definitions  and  notational 
conventions  are  introduced,  and  an  example  illustrating  the 
latt ice-theoretic  approach  to  the  theory  of  mathematical 
semantics  is  presented.  This  material  is  preparatory  to  the 
algebraic  model  for  patterns  and  pattern  matching  developed 
in  chapter  3.  By  appealing  to  familiar  algebraic  notions, 
most  notably  that  of  a  ring,  and  to  a  theory  of  mathematical 
semantics,  it  is  felt  that  the  model  developed  here  is  more 
logically  consistent,  and  hence  simpler  to  understand  and 
use  correctly  than  the  SN0B0L4  interpretive  model.  For 
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example,  the  notions  of  "negative  pattern"  and  "semi- inverse 
pattern"  allow  certain  classes  of  strings  to  be  recognized 
by  patterns  that  are  both  easier  to  read  and  simpler  to 
write  than  their  SN0B0L4  counterparts.  As  another  example, 
the  assumption-based  recursive  pattern  definition  mechaaism 
in  SN0B0L4  is  notoriously  error  prone,  whereas  the  semantics 
developed  here  for  recursive  patterns  is  consistent  and 
assumpt ion- free. 

An  important  feature  of  the  algebraic  model  is  that 
it  is  independent  of  a  host  programming  language.  Chapter 
4,  then,  is  an  exercise  in  designing  relevant  aspects  of  a 
programming  language  incorporating  the  model.  This  serves 
to  pat  the  model  into  a  context  where  it  can  more  reasonably 
be  evaluated,  and  to  suggest  an  approach  to  designing 
successor  languages  to  SNOBOL4.  The  design  in  chapter  4 
also  serves  to  raise  a  number  of  implementation  issues,  some 
of  which  are  discussed  in  chapter  5.  In  comparing  the 
efficiency  of  the  algebraic  model  with  that  of  the  SN0B0L4 
interpretive  model,  it  is  claimed  that,  subject  to  suitable 
optimizations,  implementations  of  the  two  models  should  be 
comparable.  Chapter  6  summarizes  the  results  of  the 
previous  chapters  and  suggests  directions  for  future 


research 
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2.  Definitions  and  Notation 


In  this  section  we  establish  a  number  of 
definitions  and  notational  conventions,  preparatory  to  the 
development  of  the  algebraic  model  for  string  patterns. 


Strings 

We  start  by  assuming  the  existence  of  some 
alphabet.  A,  of  atomic  symbols  or  characters.  The  nature  of 
these  symbols  is  left  undefined;  the  emphasis  in  the 
examples  of  this  section  will  be  on  a  character  set  of 
letters,  digits,  and  punctuation  symbols,  but  there  is 
nothing  in  the  development  preventing  the  symbols  of  the 
alphabet  from  representing  other  entities,  even  entities 
having  structure,  as  long  as  that  structure  is  not  used  in 
string  manipulation  operations. 


We  give  the  following  inductive  definition  of 
strings  over  A: 

Definition :  The  null  string  is  a  string  over  e^rery 

alphabet.  If  S  is  a  string  over  A,  and  a  6  A,  then  Sa  is  a 
string  over  A,  The  length  of  a  string,  S,  is  denoted  by 
|S|,  and  is  the  total  number  of  characters  in  S.  The  length 
of  the  null  string  is  zero,  i.e.,  it  consists  of  no 


characters 
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i-l-i 

The  definition  that  we  shall  give  for  pattsrns 
depends  upon  an  extended  form  of  sets.  The  definition, 
below,  of  counted  sets  provides  us  with  a  mechanism  for 
retaining  with  each  set  element  any  information  that  can  be 
encoded  as  a  single  integer: 

^  set  (also  called  a  multi-set 
[Earley  1974  ],  or  bag  [Reboh  and  Sacerdoti  1973  ])  is  a  set 
with  the  added  property  that  with  each  member  of  the  set 
there  is  associated  an  integer  denoting  the  multiplicity  of 
that  element.  If  S  is  a  counted  set  containing  the 
elements  si,  s2,  ...,  sm  with  multiplicities  n1,  n2,  ...,  nm 
respectively,  then  S  is  written  as  {n1*s1,  n2*s2,  ..., 
nm*sm) .  If  the  multiplicity  of  an  element  is  not  given 
explicitly,  that  element  is  assumed  to  have  multiplicity 
one. 

Note  that  the  definition  of  counted  sets  does  not 
exclude  the  possibility  of  zero  or  even  negative 
multiplicities.  We  postpone  for  the  moment  a  discussion  of 
negative  multiplicities;  however,  we  remark  that  if  s  is 
not  an  element  of  a  counted  set,  S,  then  s  can  be  said  to 
have  multiplicity  zero  in  S.  Ordinary  sets,  then,  are 
simply  counted  sets  in  which  all  elements  have  multiplicity 


one  or  zero 
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We  first  establish  a  partial  ordering  for  counted 

sets: 

Definition:  Let  S  and  T  be  counted  sets.  Then  S  contiins 

T,  written  S  >  T,  if  every  element  of  T  with  non-zero 
multiplicity  has  the  same  or  greater  multiplicity  in  S,.  S 
and  T  are  equal,  written  S  =  T,  if  and  only  if  S  >  T  and 
T  >  S. 

In  developing  an  algebraic  model  for  string 
patterns,  we  will  require  definitions  for  the  following 
operations  on  counted  sets: 

^  =  {n1*c1,  n2*c2,  nm*cm}  be  a  counted 

set,  and  let  k  be  an  integer.  Then  k*C  is  a  counted  set, 
and  is  defined  by 

k*C  =  {k(n1)*c1,  k(n2)*c2,  ...,  k(nm)*cm} 

We  remark  that  0*C  =  0,  (the  empty  set) . 

in  11122. •  C  ~  {n1*c1,  n2*c2,  ...,  nm*cm}  be  a  counted 

set  of  integers,  and  let  k  be  an  integer.  Then 

k  ±  C  =  {n1*(k±c1),  n2*(k±c2),  ...,  nm*  (k±cm)  } 
(Contrast  this  definition  with  that  of  k*C.  *  is  an 
operation  on  element  multiplicities,  and  hence  it  is  defined 
for  all  counted  sets;  +  and  -  are  operations  on  the 
elements  of  counted  sets  of  integers.) 

The  concept  of  set  union  can  be  extended  to  counted 
sets  in  any  of  a  number  of  ways.  We  define  here  the  notion 
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of  additive  union,  which  we  will  use  in  defining  the 
operations  of  pattern  alternation  and  concatenation: 

definition:  Let  C  and  D  be  counted  sets.  The  additive  union 
of  C  and  D,  denoted  by  C  D,  is  a  counted  set  containing 
the  set- theoretic  union  of  the  elements  of  C  and  D,  with  the 
property  that  the  multiplicity  of  each  element  in  C  D  is 
the  sum  of  its  multiplicities  in  C  and  D. 

Example:  Let  C  =  {5*a,  -6*b,  c} 

D  =  {-2*a,  6*b} 

E  =  {a,  -7*b,  c} 

then 

C  •«-  D  =  {3*a,  c} 

C  -»■  E  =  {6*a,  -13*b,  2*c) 

D  +  E  =  {-1*a,  -1*b,  c} 

Lemma:  Let  C,  D,  and  E  be  counted  sets.  Then 

(i)  C  +  D  =  D  -«•  C 

(ii)  (C  +  D)  +  E  =  C  +  (D  +  E) 

The  proofs  of  commutativity  and  associativity  of  additive 
union  follow  trivially  from  the  corresponding  properties  of 
addition  and  set  union. 

2^1  La tt ice- Theoretic  Extensions 

In  developing  a  mathematical  semantics  for 
recursive  pattern  definitions  we  will  be  using  the  lattice- 
theoretic  approach  to  the  theory  of  computation,  based  on 
the  notion  of  finite  approximation  [Donahue  1974; 
Scott  1970].  To  use  this  theory,  the  data  types  with  which 
we  will  be  dealing,  namely  strings  and  counted  sets  of 
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integers,  must  be  defined  as  complete  lattices.  A  review  of 
some  definitions  [Birkhoff  1967]  may  prove  helpful  at  this 
point : 

^  is  a  set  in  which  a  binary  relation 

X  <  y  is  defined,  which  satisfies  for  all  x,  y,  z  the 
following  conditions: 

1 .  For  all  X,  x  <  x.  (reflexive) 

2.  If  X  <  y  and  y  <  x,  then  x  =  y.  (anti-symmetric) 

3.  If  X  <  y  and  y  <  z,  then  x  <  z.  (transitive) 

Definition:  Let  A  be  a  poset  partially  ordered  by  <,  and 
let  B  be  a  non-empty  subset  of  A.  Then  an  element  a  6  A  is 
an  bound)  of  B  if  and  only  if  for  all 

b  0  B,  b  <  a  (a  <  b)  ,  The  least  (greatest)  element  of  the 
set  of  all  upper  (lower)  bounds  of  B  is  the  least  UE£§E 

k22Ill.  (2222.i2§i  i2i(2E  k2]ill4)  symbolized  by  lab  B 

(gib  B). 

^  lattice  is  a  poset  any  two  of  whose  elements 
have  a  greatest  lower  bound  and  a  least  upper  bound.  A 
lattice,  L,  is  complete  when  each  of  its  subsets  has  a  lub 
and  a  gib  in  L. 

In  dealing  with  data  types,  we  make  the  following 
definition: 

22fiEi.li2£*  ^  i.23I§:i2  is  a  lattice- structured  data  type  that 


is  partially  ordered  by  approximation 
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We  have  already  defined  a  partial  ordering  by 
containment  for  counted  sets.  To  form  a  complete  lattice 
from  the  counted  sets  of  valid  cursor  positions  for  a  given 
string,  S,  (i.e. ,  those  integers  c  such  that  0  <  c  <  |S|)  we 
add  to  the  data  type  an  overdetermined  (universal)  counted 
set,  denoted  top,  which  is  the  least  upper  bound  of  the 
domain  and  contains  every  counted  set,  and  an 
underdetermined  counted  set,  denoted  bpt,  which  is  the 
greatest  lower  bound  of  the  domain  and  is  contained  in  every 
counted  set.  We  denote  the  domain  thus  defined  by  D(S). 

To  strings,  we  simply  add  two  new  values,  top  and 
bot,  and  define  a  partial  ordering,  ap,  to  form  a  primitive 
domain,  S,  as  follows: 

S  = 


top 

I 

I 


r 

— r 

— r - ■ - 

T — 

1 

1 

1 

1 

1 

1 

1 

1 

SO 

SI 

S2  .  .  . 

Sn 

1 

1 

1 

1 

! 

1 

1 

1 

1 

1  . 

j 

1 

— I — 

I 

I 

bot 


where  SO,  SI,  S2,  ...,  Sn,  ...  is  an  enumeration  of  all 

strings  over  a  given  alphabet.  ?^dopting  the  S::ott 
terminology,  bot  is  the  greatest  lower  bound  of  the  data 
type,  and  an  approximation  of  every  string.  top  is  the 
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least  upper  bound  of  the  data  type,  and  is  an  extensioa  of 
every  string. 

The  product  domain  S  x  D(S)  is  defined  for  valid 
contexts  (S,c)  where  S  e  S  and  c  e  D  (S) .  The  partial 
ordering  on  this  lattice  is  defined  by  (S‘,ci)  ap*  (52, c^) 
if  and  only  if  Si  ap  52  in  S  and  ci  <  c2  in  D(S). 

Functions  and  Least  Fixed  Points 

The  second  requirement  of  the  Scott  theory  is  that 
mappings  between  data  types  be  continuous  in  the  sense  of 
the  following  definitions: 

^  subset  X  of  a  lattice  is  directed  if  and  only 
if  X  contains  an  upper  bound  for  each  finite  subset  of  X.. 

Definition:  A  function  f:D  ->  D*  is  continuous  if  and  only 

if  for  every  directed  subset  X  of  D, 

f(lub  X)  =  lub  {f  (x)  I  X  6  X}  , 
i.e.,  the  function  preserves  limits. 

Among  the  operations  we  have  defined  thus  far  on 
counted  sets,  only  multiplication  by  a  negative  integer 
constant  is  not  continuous,  and  will,  accordingly,  be  shown 
to  be  ineligible  for  use  in  defining  recursive  patterns. 

The  importance  of  complete  lattices  and  continuous 
functions  lies  in  the  technique  of  modelling  recursive 
functions  by  infinite  sequences  of  finite  approximations, 
and  then  showing  that  any  particular  value  of  these 
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functions  is  determined  by  a  finite  number  of 
approximations.  The  technique  is  well  illustrated  by  an 
example,  from  [Donahue  1974],  modelling  the  recursive 
factorial  function.  Re  adopt  the  notation 

f  =  (D;  X)  D*  :  body 
to  define  a  function  f:D  ->  D*. 

Exam  Eie:  Consider  the  primitive  domain  of  integers.  Hum, 
and  the  multiplication  operator  *  extended  from  the  set  of 
integers  to  the  domain  Num  as  follows: 


n1 

1 

-i 

n2 

1 

1 

n  1*n2 

i1 

1 

i2 

1 

i1xi2 

bot 

! 

i 

1 

bot 

top 

1 

i 

1 

top 

i 

I 

bot 

1 

bot 

bot 

1 

bot 

1 

bot 

top 

1 

bot 

1 

top 

i 

1 

top 

1 

top 

bot 

1 

top 

1 

top 

top 

1 

top 

1 

top 

This  extension  makes  *  continuous,  since  it  is  continuous  in 
each  argument.  Now  we  can  define  the  factorial  function 

Fact:  Num  ->  Num 

on  this  domain  by 

Fact  =  (Num:  k)  Num:  if  k=0  then  1  else  k*Fact(k-1). 
Notice  that  Fact  can  be  regarded  as  a  parameter  of  the 
right-hand  side  of  the  equation.  Thus,  the  equation 
defining  Fact  can  be  rewritten  as 


Fact  =  F(Fact) 
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where  F:  [ Nu m  ->  Num]  ->  [ Num  ->  Num]  is  defined  by 

F  =  (Num  ->  Num:  G)  Num  ->  Num: 

(Num:  k)  Num: 

if  k=0  then  1  else  k*G(k-1) 

Fact  is  then  a  fixed  point  of  F  by  the  following  definition: 

Consider  a  function  f:  D  ->  D*  for  some  domain 
D.  Then  x  6  D  is  a  fixed  point  of  f  if  and  only  if 
X  =  f  (x) .  A  fixed  point  x  of  f  is  a  least  fixed  point  of  f 
if  and  only  if  for  all  fixed  points  x1,  x2,  of  f, 

X  ap  x1 ,  X  ap  x2,  ...  . 

It  can  be  shown  that  any  continuous  function  on  a 
complete  lattice  has  a  least  fixed  point  [Scott  1972]. 
Moreover,  if  f  is  continuous,  the  least  fixed  point  x  =  f (x) 
is  given  by  the  iteration  formula 

X  =  lub  {fO (bot)  ,  f 1  (bpt) ,  f2  (bot) ,  ...  } 

where 

f  n  (bot)  =  f  (f  (. . .  (bot)  .  . . ) ) 
for  n  applications  of  f. 

Thus,  we  can  solve  the  equation 

Fact  =  F(Fact) 

for  its  least  fixed  point  using 

Fact  =  lub  {FC  (bpt),  FI  (bot) ,  F2  (bpt) ,  ...  )  where 
bot  S  Num  ->  Num  maps  each  element  of  Num  to  bpt.  We 


produce  the  sequence  of  domain  elements 
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FO  (bot)  ,  Fl(bot),  F2(bot),  ... 

where 

(bot)  a£  FI  (bot)  a£  F2  (bot)  a£  ... 
and  form  the  limit 

lub{F0(bot),  FI  (bot)  ,  F2(bot),  ...  ) 
which  is  the  least  fixed  point  of  F.  A  table  of  some  of  the 
Fn  for  various  values  of  n  and  k  will  show  that  the  least 
fixed  point  is  indeed  the  factorial  function: 


k 

! FO  (bot) 

1 

|F1 (bot)  |F2  (bot) 

j _  j_ 

|F3  (^t) 

1 

1F4  (bot) 

1  ”  ~ 

|F5(bot) IFact 

bot 

1 

1  bot 

1  bot 

1 

bot 

1 

1  bot 

1  bot 

1  bot 

1 

bDt 

0 

1  bot 

1  1 

1 

1 

1  1 

1  1 

1  1 

1 

1 

1 

1  bot 

1  bot 

1 

1 

1  1 

1  1 

I  1 

1 

1 

2 

1  bot 

1  bot 

1 

bot 

1  2 

1  2 

1  2 

1 

2 

3 

1  bot 

1  bot 

1 

bot 

1  bot 

1  6 

1  6 

1 

5 

4 

• 

• 

1  bot 

1 

1 

1  bot 

1 

1 

I 

! 

1 

b2t 

1  bot 

1 

1 

1  bot 

1 

1 

J  24 

1 

! 

1 

1 

1 

24 

• 

• 

top 

1 

1 

1 

1 

1 

1 

1 

1 

1 

12E 

1 

1 

1  top 

1 

1 

1  top 

1 

1 

1  top 

1 

1 

1 

tap 
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2^5  5 rammars 

The  meta-language  we  will  use  in  presenting 
grammars  and  grammar  fragments  is  adapted  from 
[ Wortman  1973].  Productions  are  of  the  form 

Left_side  •=*  Right_side  •;* 

where  ”Left_side"  is  a  non-terminal  symbol,  and  "R ight_side" 
is  a  sequence  of  terminal,  non-terminal,  and  meta  symbols. 
Terminal  symbols  are  enclosed  in  apostrophes.  Sgiare 
brackets  delimit  options,  and  the  meta-symbols  '•*”  and 
mean  "zero  or  more  repetitions"  and  "one  or  more 

repetitions",  respectively.  Parentheses  are  used  to  factor 
expressions,  and  alternate  expressions  are  separated  by 
vertical  bars. 

Productions  defining  lists  will  be  uniformly 
omitted.  List  elements  are  assumed  to  be  separated  by 
commas,  as,  for  example,  in  the  production 

Identif ier_list  =  Identifier  (•,*  Identifier)*  ; 

In  some  cases,  details  of  operator  precedence  and 
statement  punctuation  will  be  omitted  to  simplify  the 
grammar,  resulting  in  a  number  of  ambiguities. 
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!•.  hH.  ^l3®]braic  Model  for  String  Patterns 

lil  Introduction 

Two  facts  lead  to  the  search  for  an  algebraic  model 
for  string  patterns  for  the  specification  of  string 
transformation  procedures  -  first,  there  is  only  a  very  weak 
algebraic  structure  inherent  in  strings,  and,  second,  it  is 
the  function  of  patterns  to  impose  a  structure  upon  the 
strings  they  match.  Early  attempts  to  directly  enrich  the 
algebraic  structure  of  strings  failed  to  produce  useful 
results,  encouraging  attempts  to  approach  the  problem 
indirectly,  through  string  patterns.  A.  strong  hint  of  the 
viability  of  this  approach  is  contained  in  [Gimpel  1973], 
and  this  section  is  a  reformulation  and  extension  of 
Gimpel* s  theory  of  discrete  patterns  in  order  to  enrich  the 
algebraic  properties  of  the  model.  The  principal 
justification  for  the  model  developed  here  lies  in  the 
natural  interpretations  that  can  be  given  to  the  concepts 
that  will  be  introduced  to  enrich  the  algebra.  The  most 
notable  of  these  are  negative  and  semi-inverse  patterns, 
both  of  which  resemble  similar  but  more  awkward  constricts 


in  the  SN0E0L4  model 
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1*.2  Patterns 

Equipped  with  a  definition  of  counted  sets,  we  can 
now  give  a  definition,  adapted  from  [Gimpel  1973],  of 
patterns : 

^  is  a  function  P(S,c),  where  3  is 

a  string,  called  the  subject  string,  and  c  is  an  integer 
indexing  S,  called  the  pre-cursor  £2sition.  The  value  of 
P(S,c)  is  a  counted  set  of  integers,  called  the  post-cursor 
positions. 


Two  patterns  are  said  to  be  equal  if  and  only  if 
they  represent  the  same  function. 

Cursor  positions  can  take  values  in  the  range  0,  1, 

...,  |S|.  Thus,  the  cursor  may  be  regarded  as  a  pointer 

into  the  subject  string,  motivating  the  following 
definition: 

The  ordered  pair  (S,c),  where  S  is  a  subject 
string  and  0  <  c  <  |S|,  is  referred  to  as  a  context  for 
pattern  matching,  and  represents  a  sectioned  string 
[Griswold  1973],  sectioned  into  two  parts  by  a  cut  between 
the  c-th  and  (c+1)st  characters  of  S. 

Examples : 

1.  let  A.  be  any  string.  We  can  define  a  pattern, 
which  we  denote  simply  by  A,  as  follows:  For  all 
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contexts  {5,c),  if  Substring (S,c, I A|)  =  A,  then 

A(S,c)  =  (c^lAl),  else  A(S,c)  =  0. 

2,  Let  P  =  *AB'  and  S  =  ‘ABAB*.  Then  P(S,c)  for  c  =  0, 

2,  3,  and  4  is  {2},  {4},  tfi,  ani  bJ 

respectively, 

3.  The  SNOBOL4  primitive  pattern- valued  function  LEN 
is  defined  as  follows:  Let  n  be  a  non-negative 
integer.  For  all  contexts  (S,c) ,  if  c+n  <  |S|, 
then  LEN  (n)  (S,c)  =  {c+n)  ,  else  LEN  (n)  (S,c)  = 

We  can  now  introduce  the  concept  of  negative 
patterns,  which  is  central  to  the  extended  algebraic 
structure  of  the  model. 

Definition:  Let  P  be  a  pattern.  Then  -P  is  a  pattern,  and 

is  defined  by 

(-P)  (S,c)  =  (-1)*P(S,c) 
where  (S,c)  is  an  arbitrary  context. 

Example:  Let  P  =  LFN(1)  and  S  =  *AA»  .  Then  (-P)  (S,c)  for 
c  =  0,  1,  and  2  is  {-1*1},  (-1*2),  and  eJ 

respect ively. 

Thus,  a  negative  pattern  is  defined  in  a  purely 
algebraic  sense  as  a  function  producing  a  negative  instance 
of  a  counted  set. 


-21- 


The  final  definition  in  this  section  extends  the 
definition  of  an  arbitrary  pattern,  P,  to  allow  a  counted 
set  of  valid  cursor  positions  as  its  second  argument: 

P(S,{n1*c1,  n2*c2,  nm*cm} ) 

=  (n1*P(S,c1))  ♦  (n2*P(S,c2))  +  ...  +  (nm*P(S,cm)) 

If  the  second  argument  is  the  empty  set,  we  define 

P(S,(Zf)  =  0 

Example:  Let  P  =  *A.B*  and  S  =  *ABAB*.  Then 

P(S,{2’«'0,  1,2})  =  (2*(2})  ^  ^  {4} 

=  {2*2,  4} 


The  Algebra  of  Patterns 

We  are  now  able  to  define  the  pattern  synthesis 
operations  of  concatenation  and  alternation.  In  the 
definitions  that  follow,  P  and  Q  denote  patterns,  and  (S,c) 
is  an  arbitrary  context. 

The  concatenation  of  P  and  Q,  denoted  by  P  &  Q, 
is  a  function  defined  by 

(P  8  Q)  (S,C)  =  Q(S,P(S,c)) 

The  alternation  of  P  and  Q,  denoted  by  P  |  Q,  is 
a  function  defined  by 

(P  I  Q)  (S,c)  =  P(S,c)  +  Q(S,c) 

We  now  propose  to  show  that,  given  suitable 
definitions  of  identity  patterns  for  the  operations  of 


alternation  and  concatenation,  patterns  form  a  ring  with 
unit  element. 

Definition:  FAIL  is  a  pattern  defined  by 

FAIL(S,c)  =  0 

for  all  contexts  (S,c). 

Intuitively,  FAIL  is  a  pattern  that  fails  to  match 
any  string. 

Definition:  NULL  is  a  pattern  defined  by 

NULL(S,C)  =  (c) 

for  all  contexts  {S,c) . 

Intuitively,  NULL  is  a  pattern  that  matches  just 
the  null  string. 

Theorem:  Patterns,  under  the  operations  of  alternation  and 

concatenation,  form  a  ring  with  unit  element. 

Qf  and  R  be  arbitrary  patterns,  and  (S,c|  be 
an  arbitrary  context.  It  is  a  straightforward  process  to 
vprify  that 

1.  P  I  Q  is  a  pattern. 

2.  P  I  Q  =  Q  I  P 

3.  (P  I  Q)  I  R  =  P  I  (Q  I  R) 

4.  P  I  FAIL  =  P 

5.  p  I  -p  =  fail 

By  1.  -  5.,  patterns  under  1  form  a  group. 


6. 


P  &  Q  is  a  pattern 
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7.  (P  &  Q)  &  E  =  P  &  (Q  S  E) 

R.  P  &  (Q  I  E)  =  (P  5  Q)  I  (P  S  E) 

(Q  I  E)  &  P  =  (Q  &  P)  I  (E  &  P) 

By  1.  -  8.,  patterns  under  |  and  &  form  a  ring. 

9.  PS  NULL  =  NULL  &  P  =  P 
Q.  E.  D 

We  have  deliberately  avoided  attaching  an  informal 
semantic  interpretation  to  negative  patterns  since  their 
role  in  the  pattern  matching  model  can  be  defined  solely  in 
terms  of  inverse  elements  in  the  pattern  algebra.  However, 
the  theory  can  best  be  put  into  perspective  by  suggesting  an 
application  of  negative  patterns.  Specifically,  pattern 
negation  is  useful  in  specifying  patterns  that  are  set 
difference  expressions.  That  is,  if  we  view  patterns  as 
functional  specifications  of  sets  of  strings,  then  the 
pattern  P  |  -Q  specifies  the  sets  of  strings  matched  by  P 
but  not  by  Q,  i.e,,  the  set  difference  of  P  and  Q,  Pattern 
negation,  then,  is  a  generalization  of  the  function  of  the 
primitive  pattern  ABOET  in  SN0B0L4  [Griswold  et  al.  1971]. 

Example;  The  following  SN0B0L4  pattern  matches  any 
character  except  an  asterisk; 

ABOET  I  LEN(1) 

Using  instead  the  concept  of  negative  patterns,  we 
have; 


-**»  I  LEN(1) 

Note  that  we  could  equally  well  write; 
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,  LEN(1)  I  -*** 
whereas  the  SN0B0L4  pattern 

LEN(1)  I  ***  ABORT 
would  not  have  the  desired  effect. 

li.4  ^2  I2Z2222 

We  remark  that,  given  an  arbitrary  pattern,  P,  -P 
is  the  "additive”  (alternation)  inverse  of  P.  If  a 
"multiplicative"  (concatenation)  inverse  of  P  FAIL) ,  say 
p-i,  could  be  defined  such  that  P  &  p-i  =  P-»  &  P  =  NOLL, 
then  patterns  under  I  and  &  would  form  a  division  ring.  (We 
cannot  hope  to  further  extend  the  ring  definition  to  obtain 
a  field  since  &  is  not  commutative.)  Recall,  however,  that 
(P  5  P-i)  (S,c)  =  P-i (S, P  (S,c) ) ,  independently  of  the  context 
and  the  definition  of  P-i.  If  P  fails  to  match  in  S,  i.e. 
P(S,c)  =  9f,  then  P-MS,P(S,c))  =  P-MS,/zJ)  =  0  *  NULL(S,c). 
Thus  there  does  not  exist  a  definition  of  P-'  that  will 
yield  a  division  ring. 

A  modified  concept  of  inverse  patterns  under  5  is 
nonetheless  an  interesting  and  useful  one,  and  this  section 
is  devoted  to  the  introduction  of  such  a  concept. 

To  develop  a  theory  of  inverse  patterns,  we  start 
by  noting  that  negative  patterns  are  an  important  special 
case  of  a  more  general  concept.  Recall  the  definition  of 
-P: 


(-P)  (S,c) 


(-1)*P(S,c) 
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where  (S,c)  is  an  arbitrary  context.  Parenthesizing  the 
expression  on  the  right  side  of  this  equation  in  a  slightly 
different  way,  we  get 

(-P)  (S,c)  =  (-1*P)(S,c) 

This  immediately  suggests  the  more  general 

5.® P  be  a  pattern  and  i  be  any  integer.  Then 
i*P  is  a  pattern,  and  is  defined  by 

(s,c)  =  i*(P(s,c)) 

where  (S,c)  is  an  arbitrary  context. 


We  observe  that  if  P  and  Q  are  patterns  and  i  and  j 
are  integers,  then 

(i)  i*(P  {  Q)  =  i*P  I  i*Q 

(ii)  (i  +  j)  *P  =  i*P  I  j*P 

(iii)  i*(j*P)  =  (ij)*P 

(iv)  1*P  =  P 


From  observations  (i)  -  (iv) ,  we  conclude  that 
patterns  form  a  module  over  the  integers.  Since  there  are 
infinitely  many  primitive  patterns,  including  all  strings, 
this  module  is  not  finitely  generated.  We  can,  however, 
derive  a  useful  definition  of  semi-invertibility  base!  on 
linear  dependence  in  the  module.  The  entity  we  will  define 
is  not  a  true  inverse,  but  it  retains  most  of  the  flavour 
and  usefulness  of  the  stronger,  but  unattainable,  concept. 


First  we  note  that  in  the  module  of  patterns,  the 


following  lemma  holds: 
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Lemma:  If  P  ^  FA.IL,  then  i*P  =  FAIL  implies  that  i  =  0. 

Proof:  If  P  ^  FAIL  there  exists  at  least  one  context  (S,c) 

for  which  P(S,c)  ^  0,  Therefore,  if  i  0,  then  (i*P)  (S,c) 
=  i*(P(S,c))  (zf.  Q.  E.  D. 

P  Q  be  patterns.  Then  P  and  Q  are 

conformable,  written  P  d  Q,  if  and  only  if  for  each  context 
(S,c)  there  exist  integers  i  and  j,  not  both  zero,  SQch  that 

i*(P(S,c))  =  j*(Q(S,c)) 

or,  equivalently, 

(i«p  ,  -j*Q)  (s^c)  =  FAIL(S,c)  =  0 

That  is,  two  patterns  are  conformable  if  and  Dnly 
if  for  each  context  in  which  they  both  succeed,  the 
multiplicities  of  the  strings  matched  by  one  pattern  in  that 
context  are  a  constant  multiple  of  the  corresponi ing 
multiplicities  of  the  strings  matched  by  the  other. 

Examples :  1.  Let  P  =  ’A* 

Q  =  *A*  I  *A* 

Then  Q  =  2*P,  and  hence  P  and  Q  are 

conformable. 

2.  Let  P  be  any  string 
Q  =  LEN  (|P|) 

Then,  for  any  context  (S,c) ,  either 

P(S,c)  =  Q(S,c) 
or  P(S,c)  =  0*Q(S,c) 

and  hence  P  and  Q  are  conformable. 
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1*.5.  Semi- invertible  Patterns 

5.® ^  pattern,  P,  is  said  to  be  semi- invertible  if 
there  exists  a  pattern,  P~^,  called  a  semi-inverse  of  P, 
with  the  properties  that  for  all  contexts  (S,c) , 

1.  (P  &  P-i)  (S,c)  =  only  if  P  (S,c)  = 

(P-i  Z  P)  (S,c)  =  of  only  if  P-i  (S,c)  =  <$, 

2.  P  &  P-i  □  NULL. 

P-i  &  P  □  NULL. 

Note  that  the  restriction  P  *  FAIL,  which  was 
required  to  define  a  division  ring,  is  not  needed  here  if  we 
define  FAIL-i  by 

FAIL-i(S,c)  =  rAIL(S,c)  =  ^ 
for  arbitrary  contexts  (S,c) . 

To  avoid  heavily  parenthesized  expressions,  we 
arbitrarily  assign  precedences  to  the  operators  defined  thus 
far,  as  follows: 

highest 

I 

&  T 

I  lowest 

We  remark  that  semi-inverse  patterns  can  be  used 
(when  they  exist)  to  define  context  sensitive  patterns,  that 
is,  patterns  that  match  successfully  only  in  the  context  of 
certain  specified  strings.  Doyle  [1973]  gives  an  exaaple 
that  is  naturally  described  by  a  context  sensitive  pattern: 
A  sentence  is  recognized  as  being  an  arbitrary  string  of 
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characters  including  a  terminating  period,  which  is  follawed 
by  at  least  two  blanks. 

A  SN0B0L4  pattern  to  match  sentences  is  obscure  and 

complex: 

APB  *.  *  ax  FAIL  I  TAB(*(X  -  2)) 

The  concept  of  semi-inverse  patterns  yields 
eguivalent  but  much  simpler  and  more  transparent  patterns: 

ARB  &  *.  •  &  *  •-! 

or  ARB  &  *.  *  &  LEN(2)-i 

The  pattern  component  *.  *  S  *  ’-i  matches  a  period  in  the 

right  context  of  two  blanks,  and  fails  otherwise. 

Doyle* s  solution  is  conceptually  more  complex, 
involving  the  addition  of  a  state  variable  to  the  scanner  to 
govern  the  direction  of  scanning  in  the  subject  string,  and 
an  extension  to  the  definition  of  the  LEN  primitive: 

ARB  *.  •  LEN  (-2) 

Having  shown  that  semi-inverse  patterns  are  a 
potentially  useful  concept,  the  questions  of  immediate 
interest  concern  whether  semi-inverse  patterns  do  exist, 
and,  if  indeed  they  do  exist,  how  they  are  characterized,. 

Unfortunately,  it  is  not  possible  to  define  semi¬ 
inverses  in  such  a  way  that  all  patterns  will  be  semi- 
invertible,  as  the  following  example  illustrates: 

Example:  Let  P  =  *TA'  ]  ‘A* 

S  =  *TA* 
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Then  P(S,C)  =  (2),  by  matching  ’TA.',  and 
P(Sr1)  =  {2}  r  by  matching  *A*, 

Suppose  P  is  semi-invertible.  Then,  as  a  result 
of  the  first  defining  property  of  semi-inverse 
patterns,  our  choice  of  contexts  ensures  that  both 
(P  &  P-i)  (S,0)  and  (P  8  P-»)  (S,1)  will  be  Qon- 
null.  That  is,  there  exist  non-zero  integers,  i 
and  j,  such  that 

(P  8  P-*»)  (S,1)  =  i*  (NOLL  (S,  1)  ) 

=  {i*1),  and 

(P  8  P-M  (S,0)  =  j’t' (NULL  (5,0)  ) 

=  {j*0} 

(P  8  P-i)  (S,  1) 

Evaluating  now  the  left  sides  of  these  equations, 
we  have 

(P  8  P-i)  (S,0)  =  P-i (S,P  (S,0) ) 

=  P-MS,2) 

=  P-MS,P(S,1)) 

=  (P  8  P-i)  (S,1) 

contradicting  the  semi-in vertibility  of  P. 

Note  that  this  result  is  independent  of  whatsver 
definition  we  may  choose  for  P-^.  In  general,  if  a  pattern, 
P,  matches  successfully  in  a  given  context,  the  result  of 
applying  P  8  P-i  in  that  context  will  contain  the  pre-cursor 

position  (possibly  with  multiplicity  >  1) ,  and,  if  P  is  not 
semi-invertible,  can  contain  other  post-cursor  positions 
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(possibly  with  multiplicities  >  1) .  Further,  as  the 
preceeding  example  illustrates,  it  is  not  possible  in 
general  to  deduce  which  of  the  post-cursor  positions  was  in 
fact  the  pre-cursor  position. 

We  have  shown  that  if  P  and  Q  are  semi-invertible 
patterns,  it  is  not  in  general  true  that  P  |  Q  is  a  semi- 
invertible  pattern.  We  remark  that  P  |  Q  will,  however,  be 
semi-invertible  if  and  only  if  P  does  not  match  any  prefix 
or  suffix  substring  of  the  strings  matched  by  Q,  and  vice 
versa.  In  these  cases,  P  |  Q  is  equal  in  each  context  to 
exactly  one  of  P  or  Q,  and  hence  semi-invertibility  of  P  and 
Q  guarantees  semi-invertibility  of  P  |  Q. 


Although  alternation  tends  to  destroy  semi- 
invertibility,  we  can  prove  that  semi-invertibility  is 
preserved  under  concatenation.  First,  we  prove  a  lemma: 

Lemma:  Let  P  and  Q  be  patterns  such  that  P  a  Q,  and  let  R 

be  an  arbitrary  pattern.  Then 

(i)  P  &  R  n  Q  &  E 
and  (ii)  R  &  P  n  R  &  Q 

££22f •  will  prove  only  (i)  since  the  proof  of  (ii)  is 

quite  similar. 


If  P  n  Q,  then  for  each  context  (S,c)  there  exist 
integers  i  and  j,  not  both  zero,  such  that 

i=«'(P(S,c))  =  j*(Q(S,c))  . 


Thus, 
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R(S,i*(P(S,c))) 
i*(R  (S,P(S,c))) 
i*((P  &  R)  (S,c)) 
p  e  R 


R(S,  j*(Q(S,c))) 
j*(R(S,Q(S,c))) 
j*((Q  &  K)  (S,c)) 
Q  S  R 


Using  this  result,  we  can  easily  prove  the 
following  theorem: 

Theorem:  If  P  and  Q  are  semi- invert ible  patterns,  then 

P  &  Q  is  a  semi-invertible  pattern,  and  Q~*  S  P-i  is  a  ssmi- 
inverse  of  P  &  Q. 

Proof:  This  result  is  easily  ascertained  by  considering 
(P  S  Q)  S  (Q-‘  &  P-M  =  P  &  (Q  &  Q-i)  S  P-i 

n  P  &  NULL  &  P-i 
=  P  8;  p-i 
n  NULL 

Similarly,  (Q-i  &  P“M  (P  &  Q)  □  NULL. 

Q.E.  D. 

The  negative  results  of  this  section 
notwithstanding,  there  is  a  large  class  of  patterns  for 
which  semi-inverse  patterns  do  exist,  and  are  useful.  In  the 
next  section  we  give  a  definition  of  semi-inverse  patterns 
motivated  by  the  examples  given  above,  and  proceed  to 
identify  the  class  of  semi- in vert ible  patterns. 
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3^6  Serai^inverse  Patterns 

The  essential  characteristic  of  inverse  patterns  is 
their  ability  to  effectively  reverse  the  direction  of 
pattern  matching.  This  can  be  done  most  simply  for  string 


patterns  by 

re  versing 

the  pattern 

and 

applying 

•  1  • 

It  in 

the 

usual  way 

to 

the 

reverse  of 

the 

subject 

string. 

The 

extension 

to 

more 

complex 

patterns  presents 

few 

difficulties,  and  hence  we  give  the  following  definition: 

•  ^or  every  pattern,  P,  we  state  the  existence  of 
a  pattern,  reverse  (P),  and  define  its  standard  semi-inysrse 
pattern,  P~i,  by 

p-i  (S,c)  =  |S|  -  reverse  (P)  (reverse  (S) ,( S |  -  c) 
where  (S,c)  is  an  arbitrary  context. 

Although,  as  we  have  seen,  P-i  is  an  abuse  of 
notation,  we  will  hereafter  use  the  symbol  "-i”  consistently 
in  the  sense  defined  here.  It  has  the  advantages  of  being 
mnemonic  and  accurate  in  a  restricted  sense.  That  is,  we 
will  later  prove  that  if  P  is  a  semi-invertible  pattarn, 
then  any  semi-inverse  of  P  is  equivalent  to  the  standard 
semi-inverse  in  the  sense  of  the  following  definition: 

Definition:  Two  patterns,  P  and  Q,  are  said  to  be 
match -equivalent  if  and  only  if  for  all  contexts  (S,c)  and 
cursor  positions,  d,  d  e  P(S,c)  if  and  only  if  d  e  a(S,c|. 
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Informally,  two  patterns  are  match-equivalent  if 
and  only  if  they  match  the  same  set  of  strings  in  every 
context. 

Exajiple:  Let  P  =  \  *A»  |  'AT* 

Q  =  *A'  1  *AT*  I  *AT* 

Then  P  and  Q  always  match  the  same  strings,  e.  g.  , 

P(*AT*,0)  =  {2*1,  2} 

Q(*AT«,0)  =  (1,  2*2} 

and  hence  P  and  Q  are  match-equivalent. 

In  a  pragmatic  sense,  two  patterns  that  are  match- 
equivalent  differ  only  trivially,  since  the  multiplicities 
of  successful  matches  are  often  of  little  or  no  interest. 
Put  another  way,  the  relevant  question  seems  to  be:  what 

were  the  successfully  matched  substrings?  The  number  of 
times  each  match  occurred  is  generally  a  question  of 
secondary  importance.  SN0B0L4  goes  even  further,  by  halting 
the  pattern  matching  operation  after  the  first  successful 
match . 

Doyle  [1973]  provides  a  similar  capability  for 
SN0B0L4  by  introducing  a  state  variable  into  the  scanner 
governing  the  direction  of  scan  (pattern  matching) .  A  new 
built  in  function,  SCAN,  returns  a  directive  (primitive 
pattern)  that  permits  the  direction  of  scanning  to  be 
changed  dynamically  during  pattern  matching,  and  restored 
automatically  during  bac)ctracking .  In  addition,  certain 
primitives  permit  the  cursor  to  be  moved  ’’backwards'*  without 
changing  the  direction  of  scan.  This  approach  has  the 


-34- 


disadvantage  that  it  adds  considerable  complexity  to  the 
pattern  matching  process  by  introducing  the  directiDn  of 
scan  as  an  additional  parameter  to  each  pattern  component 
(or,  equivalently,  as  an  added  piece  of  context  information, 
supplementing  the  sectioned  subject  string) ,  Further,  the 
built  in  function  SCA.N  lacks  generality  in  that  its 
definition  (or  the  definition  of  an  equivalent  facility) 
would  be  awkward  in  any  pattern  matching  facility  not 
defined  in  terms  of  a  "scanner"  procedure.  Note  that  the 
model  we  have  defined  so  far  is  entirely  independent  of  the 
existence  of  such  a  procedure.  (Nor  does  the  model  preclude 
the  definition  of  a  "scanner"  in  the  design  of  a  language 
based  on  the  model,  should  such  a  procedure  be  desired.) 

Our  definition  of  is  incomplete,  for  we  have 
not  yet  defined  reverse  (P) ,  for  arbitrary  patterns,  P. 
reverse (S)  is,  however,  well-known  for  strings,  and  hence 
for  primitive  string  patterns: 

Definition:  Let  S  =  s1.,.sn  be  a  string,  where  si  denotes 
the  i-th  character  in  S,  Then 

reverse (S)  =  sn...s1 

The  implications  of  this  definition,  and  the 
definitions  of  the  alternation  and  concatenation  operators, 
are  that  reverse  (P)  should,  for  all  patterns,  P,  have  the 
following  axiomatic  properties: 
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(i)  reverse (P  |  Q)  =  reverse (P)  |  reverse (Q) 

(ii)  reverse  (P  F-  Q)  =  reverse  (Q)  S  reverse  (P) 

(iii)  reverse (reverse  (P) )  =  P 

If  we  define  a  mapping  P  ->  reverse (P)  from  the  ring  of 
patterns  onto  itself,  properties  (i)  and  (ii)  state  that 
reversal  is  an  anti-automorphism  [Jacobson  1951; 
Redei  1967],  and  property  (iii)  states  that  reversal  is  a 
self-inverse  mapping.  Given  these  properties,  we  can  easily 
prove  the  following  lemma: 

Lemma:  Let  P  be  an  arbitrary  pattern  and  let  i  be  an 

integer.  Then 


(iv) 

reverse (FAIL) 

=  FAIL 

(V) 

reverse(-P)  = 

-reverse (P) 

(Vi) 

reverse (i*P)  = 

i*reverse (P) 

•  (^v)  and  (v)  are  elementary  results  from  the  theory 

of  group  homomorphisms.  To  prove  (iv) ,  consider 
reverse  (P)  =  reverse (P  |  FAIL) 

=  reverse (P)  |  reverse  (FAIL) 

Thus,  reverse  (FAIL)  =  FAIL.  Using  this  result,  we  have 

FAIL  =  reverse (FAIL) 

=  reverse (P  j  -P) 

=  reverse  (P)  |  reverse  (-P) 

and  hence  -reverse(P)  =  reverse (-P) ,  proving  (v) .  To  prove 
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(vi) ,  consider  first  the  case  where  i  >  0.  Then 
reverse(i*P)  =  reverse  (P  (  P  |  • . .  |  P)  (i  repetitiDns) 

=  reverse (P)  |  reverse  (P)  |  ...  f  reverse (P| 

=  i*reverse(P) 

If  i  =0,  property  (iv)  applies,  and  if  i  <  0,  write 

i*P  =  -  (I  i  |*P) 

and  apply  property  (v)  to  the  above  result. 

Q .  E.  D , 

An  immediate  consequence  of  property  (vi)  (which 
includes  properties  (iv)  and  (v)  as  special  cases)  is  the 
following  lemma,  which  we  state  without  proof: 

Lemma:  Let  P  and  Q  be  patterns  such  that  P  a  Q.  Then 

reverse  (P)  n  reverse  (Q). 

If  P  is  semi-invertible,  and  P- ^  is  a  semi-invarse 
of  P  (we  have  yet  to  prove  this) ,  then 

NULL  =  reverse  (NULL) 

□  reverse  (P~i  &  P) 

=  reverse  (P)  &  reverse  (P”i) 

Similarly, 

NULL  a  reverse  (P~i)  &  reverse (P) . 

Since  reverse (P)  &  reverse (P-i)  clearly  fails  in  any 
context  if  and  only  if  reverse (P)  fails,  and 
reverse (P-i)  s  reverse (P)  fails  if  and  only  if  reverse (P~‘) 
fails,  we  conclude  that  reverse (P~i)  is  a  semi-inverse  of 
reverse (P) ,  and  hence  is  match-equivalent  to  (reverse  (P) )” i . 
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We  are  otherwise  unconstrained  in  defining  reverse (P- ,  and 
therefore  give  the  obvious  definition: 

Definition:  Let  P  be  any  pattern.  Then 

reverse  (P”‘)  =  (re verse  (P)  ‘ . 

If  there  are  primitive  patterns  other  thao  strings 
under  consideration,  their  reverses  must  be  individually 
defined.  Some  primitive  patterns,  such  as  the  SN0B0L4 
primitive,  LEN,  are  self-reverse  patterns.  Some,  however, 
are  not.  re  verse  (BA.L)  ,  for  example,  is  a  pattern  that 
matches  any  non-null  string  of  characters  that  is  balanced 
with  respect  to  right  and  left  parentheses.  The  reverses  of 
other  primitive  patterns  can  often  be  determined  by 
modelling  the  primitive  pattern  using  other  primitives  for 
which  reverses  are  known,  and  the  operations  of  alternation 
and  concatenation.  Some  examples  using  the  primitive 
patterns  in  SN0B0L4  will  illustrate  the  technique: 

Examples:  1.  A.RB  can  be  modelled  as 

NULL  I  LEN(1)  I  LEN  (2)  |  ... 

Hence, 

reverse (ARB)  =  reverse  (NULL  |  LEN(1)  |  LEN (2)  |  ...) 

=  reverse  (NULL)  |  reverse  (LEN (1) )  |  .... 

=  NULL  I  LEN(1)  (  LEN  (2)  |  ... 

=  ARB 

Similarly, 

reverse  (ARBNO (P) )  =  ARBNO (reverse (P)  ) 
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2.  If  S  =  s1...sn  is  any  string,  where  si  denotes 
the  i-th  character  in  S,  then  ANY(S)  can  be 
modelled  as 

s  1  I  s2  1  . . .  I  sn 

Hence, 

reverse  (ANY (S) )  =  reverse  (si  1  s2  |  ...  |  sn) 

=  reverse (si)  I  ...  |  reverse(sn) 

=  si  1  s2  I  ...  I  sn 
=  ANY(S) 

Similarly, 

reverse  (NOTANY  (S)  )  =  NOTANY(S) 

3.  If  we  write 

BREAK  (S)  =  ARBNO  (NOTANY  (S)  )  &  ANY  (S)  S  ANY(S)-i 
then 

reverse (BREAK (S) )  =  reverse (ARBNO (NOTANY (S)  )  &  ANY (S)  & 

ANY  (S)-i) 

=  reverse (ANY (S) “1)  &  reverse  (ANY (S) )  & 

reverse (ARBNO (NOTANY  (S) ) ) 

=  ANY(S)-i  &  ANY(S)  &  ARBNO  ( NOTION Y  ( S|  ) 

Thus,  reverse (BREAK (S) )  matches  all  strings  not 
containing  break  characters,  and  which  have  a 
break  character  as  immediate  left  context. 

4.  If  we  model  SPAN  (S)  as 

ANY  (S)  &  ARBNO  (ANY  (S)  )  &  NOTANY  (S)  &  NOTANY  (S)-‘ 
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then,  following  the  example  of  BREAK,  w=  see 
that  reverse  (SPAN (S) )  matches,  in  the  immediate 
left  context  of  any  character  not  appeariag  in 
its  argument,  all  strings  consisting  solely  of 
characters  appearing  in  its  argument: 
reverse (SPAN (S) )  = 

NOTANy(S)-*i  &  NOTANY(S)  &  AEBNO  (ANY  (S)  )  &  ANY(S) 


Returning  to  the  definition  of 
the  definition  of  reverse  (P)  that  the 
also  an  anti-automorphism  of  patterns, 

1.  (P  I  Q)-‘  =  (P)-‘  I 

2.  (P  S  Q)-»  =  (Q)-i  S 


P~i,  it  follows 
mapping  P  ->  P-i 
Accordingly, 

(Q)-‘ 

(P)-i 


from 

is 


It  can  also  be  shown  that 


3.  FAIL-1  =  FAIL 

4.  (-P)-i  =  -(P-i) 

5.  (P-i)-i  =  P 


We  close  this  section  with  the  proof  that  if  R  is  a 
semi-invertible  pattern,  then  all  semi-inverses  of  R  are 
match -equivalent. 

Lemma:  Let  R  be  a  semi-invertible  pattern,  and  let  P  be  a 
semi-inverse  of  P.  Then  for  all  contexts  (S,c) ,  d  6  P(5,c) 
implies  R(S,d)  0, 

Suppose  there  exists  a  context  (S,c)  such  that 
d  e  P(S,c)  and  R(S,d)  =  0.  Since  (P  &  R)  (S,c)  =  0  only  if 
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P(S,c)  =  there  must  exist  some  other  cursor  position 

e  e  P(S,c)  and  non-zero  integer  i  such  that 

R(S,e)  =  {i*c}. 


Now, 


(R  &  P)  (S,e)  =  P(S,R(S,e)) 
=  P(S,{i*c}) 

=  i*(P(S,c)) 

>  (d) 


contradicting  the  assertion  that  P  is  a  semi-inverse  of  R. 


Theorem:  Let  R  be  a  semi-invertible  pattern,  and  let  P  and 
Q  be  semi-inverses  of  R.  Then  for  all  contexts  (S,c), 
d  0  P(S,c)  if  and  only  if  d  e  Q(S,c). 


Etoof*  Suppose  there  exists  a  context  (S,c)  such  that 
d  0  P(S,c)  and  d  Q(S,c).  Since  P  is  a  semi-inverse  of  R, 
there  exists  (by  the  previous  lemma)  a  non-zero  integer,  i, 
such  that 


Now, 


R(S,d)  =  {i*c}. 


(P  S  Q)  (S,d)  =  Q(S,R(S,d)) 

=  Q(S,  £i*c}) 

=  i*(Q(S,c)) 

Since  d  -*0  Q(S,c),  this  contradicts  the  assertion  that  2  is 
a  semi-inverse  of  R.  Interchanging  P  and  Q  in  the  above 
argument  completes  the  proof. 
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3«.2  Onambiquous  Patterns 

We  can  now  turn  to  the  problem  of  characterizing 
semi-invertible  patterns  and  proving  that  if  P  is  a  ssmi- 
invertible  pattern,  then  the  standard  semi-inverse,  P“‘, 
actually  is  a  semi-inverse  of  P,  The  following  definition 
gives  a  simple  sufficient  condition  for  semi-invertibilit y. 

5.® f initiSS •  ^  pattern,  P,  is  unambiguous  if  and  only  if  for 

all  contexts  (S,c)  ,  the  cardinality  of  P(S,c)  is  at  nost. 
one. 

Theorem:  If  P  and  reverse (P)  are  unambiguous,  then  P  is 

semi- invertible,  and  p-i  is  a  semi-inverse  of  P  (and  hence 
every  semi-inverse  of  P  is  match-equivalent  to  P-*‘)  • 

Proof :  Let  (S,c)  be  an  arbitrary  context.  Two  cases  arise 

in  showing  that  P  &  P-i  n  NULL: 

case  1:  P(S,c)  =  jzf.  In  this  case  the  result  holds 
trivially,  since 

(P  S  P-i)  (S,c)  =  0*NDLL(S,c)  = 

case  2:  P(S,c)  #  gf.  Then  P(S,c)  =  k*  (c+n) ,  for  some  non¬ 
zero  integer,  k,  and  integer,  n.  Also,  since  only 
the  reverses  of  those  pattern  alternates  that 
matched  successfully  in  the  forward  direction  can 
match  successfully  in  the  reverse  direction,  we 
have  reverse (P)  (reverse  (S) , I S I -c-n)  =  k*{|Sj-c}. 
Hence, 
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(P  &  P-M  (S,c)  =  P-1  (S,P  (S,c)  ) 

=  k*  (P-i  (S,c+n) ) 

=  k*(|S|  -  reverse  (P)  (reverse  (S)  ,  I  S( -c-n) ) 

=  k*(|S|  -  k*{!S|  -  c}) 

=  k2*{c} 

=  k2*NDLL(S,c) 

and  so  P  &  P- i  n  NULL.  The  proof  that  P-i  5  P  □  NULL  is 
analogous. 


Q.P. P. 


!•.§  Recursively  Defined  Patterns 

Many  naturally  occurring  patterns  have  recursive 
definitions.  In  particular,  patterns  resembling  context 
free  grammars  can  be  used  to  recognize  context  free 
languages. 

Example:  Consider  the  following  context  free  grammar  for  a 

subclass  of  constant  arithmetic  expressions: 

E  =  T  1  E  *■»■*  T  ; 

T  =  F  I  T  »*»  F  ; 

F  =  *0*  I  M*  I  ...  I  *9*  I  •  (*  E  •)  *  ; 

In  SN0B0L4,  a  pattern  to  recognize  (match)  such 
expressions  can  be  written  as 
F  =  ANY  (•0123456789»)  |  *  (•  *E  » )  » 

T  =  F  I  ♦T  ***  F 


E 


T  1  *E  •+»  T 
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In  general,  recursive  patterns  are  those  define!  by 
equations  of  the  form: 

P  =  g(P) 

where  g  is  an  arbitrary  pattern- producing  function  of 
patterns.  Thus,  if  we  define  the  domain  of  patterns  by  the 
domain  equation 

=  s  X  D(S)  ->  D(S)  , 

where  patterns  map  strings  and  counted  sets  into  counted 
sets,  then  g  is  an  element  of  the  domain 

Pattern. 

In  this  section  we  give  a  mathematical  semantics  for  such 
patterns  using  least  fixed  point  techniques. 

In  chapter  2  we  excluded  the  use  of  negation  in 
recursive  pattern  definitions  on  the  grounds  that  negation 
(or  multiplication  by  any  negative  integer)  is  not  a 
continuous  operation.  A.s  a  result,  we  can  restrict  our 
attention  in  this  section  to  counted  sets  containing  only 
elements  with  non-negative  multiplicities.  We  will  denote 
the  sub-domain  of  D(S)  defined  by  this  restriction  by  D(3) 
Note  that  in  P(S)‘*-,  bgt  =  jzf,  since  for  all  x  €  PCS)-*-,  0  <  x. 

Patterns  are  extended  to  the  domain  S  x  P(S)*-  as 

follows: 

Definition:  Let  P  (#  FAIL)  be  any  pattern  and  (S,c)  be  an 

arbitrary  context.  Then  the  following  table  defines  the 


values  of  P  on  S  X  P(S)-*-: 
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r 


T 


1 


We 


I  P(to£,bot)  =  bot  I 
I  ”  ”  "  I 

I - h 

I  i 

I  P(S,bot)  =  bgt  I 

I  “  I 

I - 1- 

I  I 

I  P(^2^»k2D  ~  ^21  I 

I  I 

I_ L 

define  FAIL  in  the 

FAIL 


P(tO£,C)  =  tO£ 


P(S,c) 


P (bot , c)  =  bot 


extended  domain 
:  S  X  D(S)+  -> 


I  P  (top, top)  =  to2  I 

I  I 

H — - - ^ 

I  I 

1  P(s,toe)  =  toH  I 

I  I 

H - ^ 

I  I 

I  P  fbot, top)  =  tO£  I 

I  “  ““  I 

J _ J 

by 

b2t. 


Note  that  under  this  extension  any  pattern  defined 
in  terms  of  primitive  patterns  and  continuous  pattern-valued 
functions  (i.e.,  all  functions  defined  in  the  model,  with 
the  exception  of  multiplication  by  a  negative  integer)  is  a 
continuous  function. 

Returning  now  to  the  problem  of  giving  meaning  to 
patterns  defined  by  eguations  of  the  form 

P  =  g(P)  , 

where  P  and  g  are  continuous,  we  can  write  g  using  the 
extended  definition  of  patterns  and  the  notation 

f  =  (D: X) D* :  body 
to  define  a  function  f:D->D*, 

Example:  Let  P  =  'A*  |  P  S  «A* 

then  P  =  g(P) ,  where 

9  =  (Pattern :  Q)  Pattern: 

»A*”|  Q  &  *A»“ 
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P  is  thus  a  fixed  point  of  q,  and  we  can  solve  the 
equation  P  =  g(P)  for  its  least  fixed  point  using 

P  =  lub{gO  (FA.IL)  ,  gl(FAIL),  g2  (FAIL)  ,  ...  } 

We  produce  the  sequence  of  domain  elements 

gO  (FAIL)  ,  g1(FAIL),  g2  (FAIL)  ,  ... 
where,  for  arbitrary  (S,c)  6  S  x  P(S)-*-, 

gO  (FAIL)  (S,c)  <  g1  (FAIL)  (S,c)  <  g2  (FAIL)  (S,  c)  <  ... 
and  form  the  limit 

lub {gO (FAIL)  ,  g 1 (FAIL)  ,  g 2 (FAIL),  ...  } 
which  is  the  least  fixed  point  of  g.  A  table  of  some  of  the 
gn  for  various  values  of  n  and  contexts  (S,c)  will  show  that 
the  least  fixed  point  definition  of  the  pattern  in  the 
previous  example  does  indeed  match  any  sequence  of  one  or 
more  *A's: 


(S,c)  1  go (FAIL) 

1 

|g1  (FAIL) 

1 .   

|g2(FML) 

1 

1 g3 (FAIL) 

1 

|g4(F?^IL)  I  P 

l L 

(bot,0)  i 

•  1 

•  1 

0 

i  ^ 

1 

1 

i  0 

1 

\ 

i  0 

1 

1 

I  0 

I 

I 

{  0 

I 

! 

•  1 

•  1 

(*AAA* ,0)  1 

1 

1 

1  {1} 

1 

1 

1  {1,2} 

1 

1 

1  {1,2,3} 

I 

I 

I  {1,2,3} 

I 

I 

I {1,2,3} 

(•AAA*,1)  ( 

I  {2} 

I  {2,3} 

I  {2,3} 

I  {2,3} 

1  {2,3| 

('AAA* ,2)  I 

I  {3} 

I  {3} 

I  {3} 

I  {3} 

1  {3} 

(•AAA* ,3)  I 

•  \ 

•  1 

I 

I 

I 

I  0 

I 

I 

I  0 

I 

I 

I  0 

I 

I 

1  0 

1 

1 

•  1 

•  1 

(iOErtO£) 1 

0 

I 

I 

I  122 

1 

1 

1  122 

I 

I 

I  top 

I 

I 

I  top 

1 

1 

1  top 

The  following  points  summarize  the  main  features  of 
the  algebraic  model  developed  in  this  chapter: 

1.  Patterns  are  defined  as  counted  set  producing 
functions  of  contexts  or  counted  sets  of  contexts. 

2.  Negation,  alternation  and  concatenation  are 
pattern-valued  functions  of  patterns,  and  are 
defined  in  terms  of  operations  on  counted  sets, 

3.  Patterns,  under  the  operations  of  alternation  and 
concatenation,  form  a  ring  with  unit'  element.  The 
additive  identity  is  FAIL,  and  the  multiplicative 
identity  is  NULL. 

4.  The  algebra  cannot  be  further  extended  to  define 
a  field  or  even  a  division  ring.  We  can  show, 
however,  that  patterns  form  a  module  over  the 
integers. 

5.  Semi-inverse  patterns  are  defined  in  terms  of 
conformity,  a  notion  based  on  linear  dependence  in 
the  module. 

6.  Not  all  patterns  are  semi- invertible.  Further, 
if  P  and  Q  are  semi-invertible  patterns,  then  P  5  Q 
is  semi-invertible  but  P  |  Q  need  not  be. 
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7.  For  an  arbitrary  pattern,  P,  a  definition  of  its 
standard  semi-inverse,  P~‘,  can  be  formulated  in 
terms  of  reverse (P) ,  and  it  is  shown  that  any  semi¬ 
inverse  of  P  is  match-equivalent  to  P-^,  A  method 
for  defining  P-i  for  an  arbitrary  pattern,  P,  is 
outlined. 

8.  Dnambiguity  is  defined,  and  is  shown  to  be  a 
simple  sufficient  condition  for  semi-in vertibility, 

9.  The  semantics  of  recursive  pattern  definitions 
can  be  defined  using  least  fixed  point  techniques. 

The  development  in  this  chapter  has  been 
exploratory  and  abstract,  with  only  occasional  references  to 
programming  applications  in  order  to  suggest  interpretations 
or  applications  of  algebraic  features.  The  ramifications  of 
the  model  in  terms  of  string  manipulation  language  design 
remain  essentially  unexplored,  and  it  is  therefore  the 
purpose  of  the  next  chapter  to  facilitate  evaluation  of  the 
model  by  embedding  it  in  the  environment  of  a  programming 
language. 
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Hi.  ^  ii§a3]l§a§  f2£  Pat  tern- Directed  String  Manipulation 


i*.l  Introduction 

The  model  of  patterns  and  pattern  matching 
described  in  the  preceding  chapter  has  been  designed  to  be 
independent  of  considerations  of  a  supporting  language.  By 
choosing  an  algebraic  rather  than  an  interpretive  model,  we 
have  been  able  to  separate  those  issues  inherent  to  the  data 
structures  and  operations  of  pattern  matching  from  those 
concerned  with  its  efficient  implementation.  Considerations 
of  the  latter  type  include  pattern  matching  heuristics  and 
backtracking.  Further,  the  algebraic  model  allows  the 
language  designer  to  retain  the  liberty  to  incorporate  the 
data  and  control  structures  best  suited  to  meeting  specific 
language  design  goals.  In  contrast,  Markov-algorithm-based 
languages  such  as  SNOBOL  and  COMIT  have  control  structures 
that  are  strongly  influenced  by  the  concept  of  expression 
and  statement  continuations  [Tennent  1973],  which  are  in  a 
sense  the  result  of  pattern  matching  in  these  languages. 

Two  approaches  to  the  problem  of  putting  the 
pattern  matching  model  into  the  environment  of  a  programaing 
language  seem  viable.  The  first  is  to  embed  the  model  into 
an  existing  programming  language.  This  approach  has  the 
advantage  of  providing  a  familiar  framework  against  which 
the  model  can  be  evaluated  with  a  minimum  of  bias  spawned  by 
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environment.  The  principal  difficulties  that  arise  concern 
the  interactions  of  new  features  with  existing  data  and 
control  concepts  of  the  host  language.  Tennent  [1973]  has 
shown  that  by  using  mathematical  semantics  to  model  both  the 
new  features  and  the  language  into  which  they  are  to  be 
embedded,  the  semantics  of  the  resulting  dialect  can  be  well 
understood.  He  has  illustrated  these  techniques  by 
incorporating  SN0B0L4  patterns  and  pattern  matching  into 
QUEST  [Fenner  et  al.  1973]. 

A  second  approach,  adopted  here,  is  to  design  a  new 
language,  incorporating  pattern  matching  as  a  ceatral 
feature.  This  will  allow  us  to  address  more  directly 
problems  concerned  with  the  development  of  techniques  for 
string  manipulation  language  design.  Our  goal  will  not  be 
to  design  a  complete  language,  but  rather  to  explore  the 
interactions  of  the  model  with  some  relevant  aspects  of 
language  design.  To  do  so,  we  will  partially  develop  a 
language  designed  to  be  a  useful  aid  in  understanding  the 
processes  of  pattern-directed  string  manipulation,  and  hence 
in  the  eventual  design  of  subsequent  higher  level 
applications  languages. 

The  language  is  intended  to  be  a  string 
manipulation  analog  of  the  Machine  Oriented  Higher  Level 
Languages  of  systems  programming  [Clark  and  Horning  1973; 
van  der  Poel  1974].  That  is,  the  language  will  be  "low 
level"  in  terms  of  the  pattern  matching  model,  providing 
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counted  sets  as  a  basic  aggregate  type  of  the  language  and 
treating  patterns  just  as  they  are  defined  in  the  model.  To 
complement  the  low  level  aspects  of  the  language,  emphasis 
will  be  placed  on  providing  convenient  tools  for  abstraction 
and  modularization  in  the  form  of  functional  abstractions 
(procedures)  and  abstract  data  types  (clusters) ,  The 
reasons  for  designing  such  a  language  are  twofold.  First, 
we  want  the  mapping  from  algebraic  model  to  programming 
language  to  be  a  simple  one,  since  it  is  difficult  to  divine 
how  the  model  can  most  effectively  be  presented  to  its 
users.  By  implementing  the  model  at  a  low  level  and 
providing  mechanisms  for  abstraction,  much  of  the  language 
design  process  becomes  the  responsibility  of  the  user.  This 
is  the  philosophy  expressed  in  [ McKeeman  1966]  and  has  been 
used  successfully  with  SIMULA  67  [Dahl  et  al,  1972],  By 
studying  well-written  programs  in  a  language  such  as  that 
proposed  here,  it  should  be  possible  for  language  desigaers 
to  base  decisions  on  which  operations  should  be  made 
primitive  in  higher  level  pattern-based  string  manipulation 
languages  on  a  sound,  unbiased  empirical  basis. 

The  remarks  of  the  previous  paragraph  apply  equally 
to  the  problems  in  string  manipulation  language  design 
concerned  with  the  selection  of  language  features  that 
support  but  which  are  not  central  to  the  processes  of  srring 
manipulation.  Here  we  are  concerned  with  the  choice  of 
primitive  data  types,  primitive  functions  (such  as  the 
SN0B3L4  functions  DDPL,  REPLACE,  SIZE,  etc.),  the  form  and 


-51- 


extent  of  input  and  output  facilities,  and  so  on.  Further, 
sinoe  no  language  designer  can  forsee  all  the  abstractions 
that  will  be  required  by  the  users  of  his  language,  and 
silica  experience  has  shown  that  it  is  desirable  to  design 
languages  that  are  sufficiently  small  that  they  can  be  fully 
mastered  by  their  users,  it  is  worthwhile  to  include 
abstraction  facilities  such  as  those  described  below  even  in 
very  high  level  languages. 

Language  design  decisions  can  be  isolated  by 
partitioning  the  language  into  distinct  levels,  as  suggested 
by  McKeeman  [1973].  We  will  group  abstractions  and 
declaratives  into  the  highest  level  of  language,  and  design 
these  facilities  first.  At  this  level,  the  terminal  syibols 
of  the  grammar  will  include  ”Language_type"  (the  built  in 
scalar  and  aggregate  types  of  the  language) ,  "Statement", 
and  "Expression",  and  all  design  decisions  will  be 
independent  of  the  fact  that  the  language  will  ultimately 
include  string  manipulation  primitives  and  pattern  matching. 
The  design  goals  at  this  level  are  minimality,  simplicity 
and  uniformity  of  design  and  expression,  separate 
compilation  of  abstractions,  both  procedural  and  typal,  and 
efficiency  of  execution. 

The  language  will  be  strongly  typed.  Strong  typing 
minimizes  the  design  complications  arising  from  intsractions 
of  language  features,  thus  simplifying  the  overall  language 
design  and  improving  program  understandibility ,  reliability. 
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and  efficiency.  We  feel  that  languages  such  as  SN0B0L4  both 
benefit  and  suffer  from  the  absence  of  strong  typing. 
Dynamic  typing  gives  the  programmer  a  great  deal  of 
flexibility  in  writing  programs  that  are  both  concise  and 
powerful.  It  is  at  the  same  time  a  common  source  of  errors 
and  inefficiencies.  By  incorporating  strong  typing  in  this 
language,  it  is  hoped  that  the  ramifications  of  relaxing 
typing  restrictions  in  string  manipulation  language  designs 
can  be  better  understood,  resulting  in  languages 
characterized  by  both  flexibility  and  reliability. 

4^2  Prog rams 

‘ii.Zi.l  Grammar 

Program  =  Abstraction_def inition  *_l_*  ; 

A.bstraction_dGf inition  =  Procedure_def inition  | 

Cluster_def inition  ; 

Procedure_def inition  =  Identifier  *=*  'procedure' 

['('  Formal_parameter__list  ')*] 

['returns'  Type_specif ication ] 
Procedure_body  ; 

Formal_parameter  =  Identifier  *:'  Type_specif ication  | 

Identifier  '='  Procedure_specif ication  ; 

Type_specif ication  =  Declarat ion_ty pe  |  Subrange  I  ; 

Declaration_type  =  Language_type  | 

Identifier  [ ' ('  Expression_list  ') ' ] 

['of  Type_specif  ication_list  ]  ; 

Subrange  =  Expression  ['..'  Expression]  ; 

Procedure_specif ication  =  'procedure' 

['('  Specif ication_list  ')'] 

['returns'  Type_specif ication ]  ; 

Specification  =  Type_specif ication  |  Procedure_specif ication  ; 
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Procedure__body  =  (Definition)*  (Variables_declaratioQ)  * 

(Statement)*  ‘end*  ; 

Definition  =  Abstraction_def inition  | 

Identifier  *  =  *  Type__specif ication  ; 

Variabl9s_declaration  =  Identif ier^list  *:* 

Type_specification  [’initial*  ^Sxpression]  ; 

dust er_def inition  =  Identifier  *  =  *  'cluster* 

[*(*  Formal_parameter_list  *)*] 

['of*  Identif ier_list ]  Eecord_body 
Procedure_body  (Operation_declaratioa|  ♦  ; 

Record_body  =  (Variables_declaration) * 

['case*  Identifier  *:*  Type  specification 
*of *  (  (Label  * : *) ♦ 

*(*  (Variables_dsclaration) *  *)*)+ 

[ *  otherwise*  * : * 

*(*  (Variables_declaration) *  *)*]]  ; 

Label  =  Constant  |  Subrange  ; 

Operation_declaration  =  Identifier  *:*  'operation* 

[*(*  Formal_pararaeter_list  *)*] 

['returns*  Type_specif ication ] 
Procedure_body  ; 

Compilation  Dnits 


The  compilation  unit  of  the  language  is  the 
definition  of  either  a  cluster  (abstract  data  type)  or  a 
procedure  (abstract  operation) •  Compilation  is  facilitated 
by  prohibiting  reference  to  global  variables  in  procedure 
and  cluster  definitions;  all  communication  must  take  place 
via  parameter  lists,  and,  in  the  case  of  procedures, 
returned  values.  Although  not  a  necessary  restriction,  the 
task  of  compile-time  type  checking  parameters  and  results 
can  be  eased  by  requiring  that  the  definition  of  each 
abstraction  used  in  a  program  be  placed  physically  before 
its  use.  For  simplicity,  ALGOL  scope  rules  determine  the 
scope  of  definition  names. 
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A  set  of  compiler  directives  could  be  designed  to 
store  and  retrieve  compiled  definitions  in  object  module 
form  from  a  data  base  established  for  the  purpose,  allowing 
programs  to  share  a  common  library  of  definitions.  It  is 
this  mechanism  that  would  allow  the  language  to  effectively 
evolve  as  useful  abstractions  are  recognized  and  programmed. 

Definitions 

Each  definition  consists  of  an  identifier  naming 
the  abstraction  to  be  defined,  followed  by  an  equality 
symbol,  followed  by  a  type  specification,  in  which  case  the 
ideatifier  names  a  type  or  a  subrange  of  a  type,  or  one  of 
the  reserved  words  "procedure"  or  "cluster",  identifying  the 
abstraction  as  a  procedure  or  abstract  data  type.  The  term 
"cluster"  is  due  to  Liskov  and  Zilles  [1974],  and  reflects 
the  view  that  an  abstract  object  is  completely  characterized 
by  the  set  of  operations  that  can  be  performed  on  it. 
Although  the  value  of  abstract  data  types  is  widely 
recognized  [Liskov  and  Zilles  1974;  Mealy  1967],  mechanisms 
for  defining  data  abstractions  are  not  yet  well  understood, 
as  evidenced  by  the  large  body  of  ongoing  research  ia  the 
area.  The  scheme  presented  here  is  very  primitive,  and  has 
been  chosen  primarily  to  meet  the  goals  of  simplicity  of 
expression  and  efficiency  of  implementation;  it  is  a  variant 
of  the  scheme  presented  in  [Liskov  and  Zilles  1974],  and  has 
been  included  primarily  to  indicate  the  direction  in  which 
we  feel  such  language  facilities  should  evolve. 
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4.2. 3,1  Cluster  Definitions 


Example; 


Stack  =  cluster (Maxsize;  Integer)  of  Element_type 

Top :  0. . Maxsize 

Stk;  Array  ( 1. .  Maxsize)  of  Element__type 

Top  :=  C 

Push;  operation  (V;  Element_type) 
if  Top  =  Maxsize  then  Error 
Top  ;=  Top  +  1 
Stk  (Top)  ;=  V 

end 

Pop;  operation  returns  Element_type 
if  Top  =  0  then  Error 
Top  ;=  Top  -  1 
return  Stk  (Top  1) 

end 

Erase__top;  operation 

if  Top  =  0  then  Error 
Top  ;=  Top  -  1 

end 

Empty;  operation  returns  Boolean 
return  Top  =  0 

end 

Full;  operation  returns  Boolean 
return  Top  =  Maxsize 

end 

end 


An  abstract  type  definition  consists  of 


1.  A  representation  of  the  abstract  type  as  a  record 
body.  Other  abstract  data  types  may  be  used  in  declaring 
the  record  fields  of  a  representation,  but  recursive  use  of 
abstract  data  types  is  not  permitted.  This  restriction 
significantly  improves  opportunities  for  execution 


efficiency  by  permitting  the  allocation  of  a  fixed  amount  of 
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storage  (exclusive  of  that  used  for  representing  sequences 
and  sets) . 

2.  A  procedure  body  that  is  executed  when  an  abstract 
data  type  is  used  to  declare  a  variable.  Typically,  this 
procedure  will  initialize  the  representation. 

3.  A  set  of  characterizing  operations  of  the  data 
type,  chosen  either  axiomatically  or  pragmatically. 
Operations  are  defined  by  their  implementation  as  a  special 
form  of  procedure. 

The  parenthesized  parameters  of  a  cluster 
definition  permit  the  programmer  to  parameterize  various 
dimensions  of  the  type,  such  as  the  maximum  stacksize  ii  the 
example  above.  These  parameters  can  also  be  used  to 
initialize  type  components,  and  so  on.  Barely  should  a 
cluster  have  more  than  two  or  three  parameters  of  this  sDrt, 
so  a  positional  notation  has  been  adopted  for  its  advantages 
of  conciseness  and  simplicity.  Further,  the  intended  nature 
of  these  parameters  is  such  that  read-only  reference  is  the 
most  appropriate  and  efficient  parameter  passing  mechanism, 
and  is  therefore  the  only  parameter  mechanism  for  clusters. 

The  optional  clause  '’*of*  Identif ier_list"  permits 
the  programmer  to  define  types  that  are  composed  of  other 
types;  just  as  arrays  are  composed  of  elements  of  a  type 
specified  in  the  array  declaration,  the  stack  in  the  above 
example  is  composed  of  elements  of  a  type  specified  ia  the 
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stack  declaration.  Since  "type"  is  not  a  type  of  the 
language,  these  are  not  parameters  in  the  usual  sense,  and 
the  syntax  of  the  clause  emphasizes  this  distinction. 
Whenever  a  variable  of  abstract  type  is  declared,  the  actual 
types  in  the  declaration  are  substituted  in  the  cluster 
definition  for  purposes  of  storage  allocation  and  type 
checking. 


The  representation  of  the  abstract  type  is  given  as 
the  body  of  a  record  declaration,  including  possibly  a 
variant  part.  (Records  are  discussed  in  detail  in  a  later 
section.)  The  names  of  the  variables  comprising  the 
representation  are  local  to  the  cluster,  and  can  be  accessed 
only  by  the  initialization  procedure  and  the  operations  of 
the  data  type.  In  the  example  above,  the  stack  is 
implemented  as  an  array  of  the  element  type  of  the  stack, 
and  is  identified  locally  as  Stk.  The  upper  bound  of  the 
array  is  given  by  the  parameter  Maxsize,  and  a  top  of  stack 
pointer.  Top,  is  declared  to  index  the  array.  A  different 
representation  of  the  stack,  as  a  string  instead  of  an 
array,  might  be  appropriate  if  the  stack  sizes  were  expected 
to  vary  widely,  or  if  no  appropriate  estimate  of  the  maximum 
stack  depth  could  be  made  when  a  stack  is  declared.  The 
important  point  is  that  such  decisions  are  purely 
representational,  and  can  be  localized  within  the  cluster 


definition 
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The  initiali2ation  procedure  following  the 
representation  of  the  abstract  type  consists  of  a  procedure 
body,  which  can  contain  local  definitions  and  declaratisns, 
and  which  can  reference  the  representation  and  cluster 
parameters.  In  the  stack  example  above,  the  initialization 
consists  only  of  the  statement  "Top  :=  0".  (This  could 
equally  well  have  been  achieved  by  initializing  Top  in  its 
declaration. ) 

Note  that  the  syntax  of  operations  is  that  of 
declarations,  not  definitions.  This  reflects  the  fact  that 
when  a  variable  is  declared  with  abstract  type,  the 
operations  of  that  type  are  declared  to  be  attributes  of  the 
variable.  For  example,  if  the  variable  S  is  declared  as 

S:  Stack  (100)  of  Integer 

the  "Push"  operation  becomes  an  attribute  of  S,  and  is 
invoked  by  statements  of  the  form 

S. Push (Value) 

As  the  grammar  and  example  illustrate,  the  syntax 
of  operations  after  the  reserved  word  "operation"  is  exactly 
that  of  procedures.  Unlike  procedures,  however,  operations 
may  refer  to  non-local  variables,  but  only  to  the  cluster 
parameters  and  those  declared  as  the  representation  of  the 
abstract  data  type.  (It  is  for  this  reason  that  operations 
cannot  be  separately  compiled.) 
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4.  2. 3 «  2  Procedure  Def init ions 

The  abstraction  mechanism  for  encapsulating 
subalgorithms  within  a  program  is  the  procedure.  Like 
operations,  procedures  may  optionally  be  parameterized 
and/or  return  a  value  to  the  calling  program.  Unlike 
operations,  procedures  may  not  access  global  variables;  all 
information  transfer  must  take  place  through  parameter  lists 
and  returned  values.  This  eliminates  the  disadvantages  of 
programming  with  global  variables  [Wulf  and  Shaw  1973], 
encourages  narrow  program  interfaces,  and  facilitates 
separate  compilation,  at  (presumably)  little  loss  of 
programming  power.  This  is  a  consequence  of  the  separation, 
following  Liskov  and  Zilles,  of  the  concepts  of  "functional 
abstraction"  and  "characterizing  operation".  Any  procedure 
that  accesses  a  global  variable  is  conceptually  an  operation 
(on  that  variable)  ,  or  can  be  decomposed  into  a  procedure 
and  an  operation,  and  should  therefore  be  implemented  as 
such. 

The  parameter  specification  mechanism  of  the 
language  allows  the  passing  of  both  data  and  procedure  names 
as  parameters  to  procedure  and  cluster  definitions.  The 
definition  and  use  of  parametric  procedures  is  facilitated 
by  the  elimination  of  global  variables,  although  strong 
typing  requires  that  parameter  definitions  for  parametric 
procedures  contain  a  template  specifying  the  parameter  and 
result  types  of  the  parametric  procedure. 
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To  permit  the  efficient  passing  of  large  lata 
objects  as  parameters  to  procedures  and  operations,  and  to 
minimize  the  number  of  concepts  in  the  language,  we  will 
specify  reference  as  the  only  form  of  parameter  passing  to 
procedures  in  the  language.  Actual  parameters  that  are 
constants  or  expression  values,  and  all  cluster  parameters, 
are  restricted  to  be  read-only.  Although  the  argument  for 
positional  notation  with  procedure  parameters  is  less  strong 
than  that  for  cluster  or  operation  parameters,  consistency 
and  minimality  of  constructs  dictate  that  the  same  form 
should  be  used  in  all  three  cases. 

Formal  parameter  declarations  need  specify  only 
enough  type  information  to  allow  all  uses  of  the  parameters 
to  be  type  checked  at  compilation.  Thus,  parameter  and 
component  types  of  abstract  types  may  be  specified  by 
asterisks  where  appropriate,  indicating  to  the  compiler  that 
the  type  specification  it  replaces  will  not  be  referenced  in 
compiling  the  procedure  or  cluster  body  that  follows. 

Exaiples: 

PI  =  procedure (S:  Stack (*)  of  Integer) 

S.  Push  (5) 

end 

P2  =  procedure (S:  Stack (*)  of  *)  returns  Boolean 
return  S. Empty 


end 
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4^2j_4  Declarations 

Variables  are  declared  by  the  occurrence  of  an 
identifier  (or  identifier  list  if  more  than  one  variable  is 
being  declared  of  the  same  type),  followed  by  a  colon, 
followed  by  a  type,  or,  in  the  case  of  those  types  which  are 
ordered  and  for  which  constants  can  be  expressed  in  the 
language,  a  subrange  of  a  type.  A  variable  may  take  values 
only  of  the  type  for  which  it  is  declared,  and  variables 
must  be  declared  before  use.  In  the  case  of  variables 
declared  by  means  of  subranges,  the  type  of  such  variables 
is  the  type  of  the  expressions  delimiting  the  subrange,  and 
the  set  of  values  which  such  variables  may  take  is  givan  by 
the  subrange.  Note  that  the  specification  of  subranges 
consisting  of  a  single  value  can  be  abbreviated.  Further, 
constant  declarations  and  definitions  can  be  used  to  name 
constant  valued  expressions.  Variables  can  be  initialized 
in  their  declarations  by  the  optional  clause  initial* 
Expression”.  All  variables  not  explicitly  initialized  are 


automatically 

initialized  to 

an 

undef ined 

value,  which 

is 

represented 

in  program 

texts  by 

the 

reserved 

word 

"Undefined”. 

Variables  may 

be 

te sted 

for 

equality 

with 

Undefined,  but  it  is  otherwise  an  error  to  use  an  undefined 
value  in  the  evaluation  of  an  expression.  Variable  types 
can  be  specified  as  primitive  scalar  or  aggregate  types  of 
the  language,  by  type  definition,  or  by  name  (parameterized 
if  necessary)  if  the  type  has  been  previously  defined. 
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Basic  and  Operations 

Grammar 

Language_type  =  ‘Character*  | 

‘Boolean*  | 

‘Integer*  | 

* {*  Ident if ier^list  *}*  | 

*[*  Ident if ier_list  *  ]*  | 

‘Array*  *  (•  Subrange  *)*  ‘of*  Type  | 

‘String*  [‘of*  Type  ]  | 

[‘Counted*]  ‘Set*  ‘of*  Type  | 

‘Eecord*  Eecord_body  ‘end*  ; 

fL*.3^2  Types  and  Operations  on  Types 

Our  goal  at  this  next  level  of  the  language  design 
is  to  select  a  small  set  of  scalar  and  aggregate  types  that 
are  in  some  sense  basic  to  the  construction  of  higher  lavel 
abstract  data  types,  and,  more  importantly,  basic  to  the 
problem  area  for  which  the  language  is  designed. 

Associated  with  each  type  is  a  set  of  valid 
operations  on  that  type.  We  feel  that  values  of  a  type 
should  be  assignable  to  any  variable  of  that  type  and 
testable  for  equality  with  other  values  of  the  same  type. 
This  is  the  case  with  the  scalar  types  of  the  language,  but 
to  avoid  belabouring  the  non- string-manipulation  aspects  of 
the  language,  wg  have  not  included  the  (substantial) 
additional  mechanisms  required  to  define  these  operations  on 
aggregate  primitive  and  abstract  types,  Zilles  [1974]  has 
suggested  an  approach  to  the  problems  involved.  Other 
operations  are  listed  in  the  type  descriptions  below. 
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We  propose  only  four  scalar  types  for  the  language. 
These  are: 

1.  Characters,  from  an  implementation-defined 
character  set.  Character  constants  are  written  as  a 
character  enclosed  by  quotation  marks. 

Examples: 

Mg  II 

II  II 

II  II  It 

Characters  are  an  ordered  data  type,  and  hence  the 
comparison  predicates  are  defined.  For  simplicity,  we  leave 
the  specification  of  order  to  the  implementation.  If 
transportability  were  a  design  goal  of  the  language, 
mechanisms  could  be  provided  for  the  specification  of  both 
the  character  set  and  its  ordering  in  an  implementation- 
independent  fashion. 

2.  Booleans,  The  constants  of  this  type  are  "True” 
and  "False”,  Logical  operations  include  conjunction, 
disjunction,  and  negation.  All  of  the  predicate  operations 
in  the  language  return  Boolean  results. 

3.  Integers,  in  an  implementation-defined  range. 
Since  the  principal  uses  envisioned  for  integers  are 
indexing  rather  than  number-theoretic  applications,  it  is  to 
our  advantage  to  sacrifice  the  portability  of  some  programs 
for  the  efficiencies  obtainable  by  res^tricting  ourselves  to 
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integers  in  the  range  handled  directly  by  the  hardware  of 
the  machines  on  which  programs  are  being  run.  Since  the 
ordering  of  integers  is  not  implementation-dependent,  the 
programmer  can  shelter  programs  that  are  to  be  transported 
by  specifying  subranges  rather  than  "Integer"  whenever 
possible  in  variable  declarations.  Specifying  subranges 
where  appropriate  also  has  the  advantage  of  leaving  the 
compiler  the  freedom  to  optimize  representations. 

Integer  arithmetic  operations  include  negation, 
addition,  subtraction,  multiplication  and  integer  division. 

4.  User-defined  scalars.  These  provide  a  useful  and 
easily-implemented  primitive  abstraction  mechanism.  Such 
scalars  are  typically  represented  in  languages  without  this 
feature  by  strings  or  integers.  These  mechanisms  tend  to 
obscure  meaning  and  reduce  reliability,  justifying  an 
alternative  language  feature.  Accordingly,  we  represent 
user-defined  scalar  types  as  sequences  or  sets  of 
identifiers,  representing  the  values  of  the  type.  Two 
classic  examples  are: 

Colour  =  [Red,  Orange,  Yellow,  Green,  Blue,  Violet] 
Direction  =  {North,  South,  East,  West) 

We  distinguish  ordered  from  unordered  user-defined  scalar 
types  by  using  sequence  rather  than  set  delimiters. 

The  comparison  predicates  (other  than  equality  and 
inequality)  are  defined  only  on  ordered  types. 


The  aggregate  types  of  the  language  are; 

1.  Arrays  of  arbitrary  type.  Arrays  are  useful  iata 
aggregates  in  their  own  right,  and  are  a  valuable 
representational  tool  in  the  definition  of  abstract  data 
types.  For  simplicity,  we  define  only  one-dimensianal 
arrays  indexed  by  scalar  types,  but  array  elements  may  b5  of 
arbitrary  type,  including  arrays.  The  lower  and  upper  array 
bounds  are  specified  by  a  subrange  argument  to  the  array 
declaration. 

Element  selection  is  the  only  array  operation  in 
the  language. 

2.  Strings  of  arbitrary  type.  Like  arrays,  the 
elements  of  a  string  are  homogeneous  with  respect  to  type, 
and  can  be  referenced  by  indexing  the  string  by  an  integer¬ 
valued  expression.  Unlike  arrays,  strings  are  of  dynamic 
length,  up  to  an  implementation-defined  maximum.  The  length 
of  a  string  can  be  ascertained  at  any  time  by  means  of  the 
primitive  function  "Length”.  Indexing  in  strings  is  always 
zero  origin. 

Although  we  have  identified  strings  with  sequences 
of  values  of  an  arbitrary  homogeneous  type,  the  most 
commonly  occurring  string  type  will  naturally  be  "String  of 
Character".  For  this  reason,  we  allow  a  single  default 


specification  within  the  language;  if  the  base  type  of  a 
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string  type  is  not  specified,  "Character”  will  be  assumed. 
Thus,  "String  of  Character”  may  be  abbreviated  to  "String". 

Single  characters  (elements)  of  a  string  can  be 
indexed  directly,  and  the  "Substring"  operation  and  pseudo¬ 
variable  can  be  used  to  access  and  replace  substrings.  The 
only  infix  operation,  other  than  equality,  provided  for 
strings  is  concatenation.  Since  there  are  a  number  of 
different  ways  to  define  string  comparison  predicates,  we 
leave  these  to  be  user-defined. 

Exajaple:  If  S  =  'DISHWASHER*,  then  S(5)  =  *  A*  , 

Substring  (S, 4 , 4)  =  'WASH*,  and  the  statement 

Substring (S,0, 4)  :=  'CLOTHES* 

results  in  S  having  the  value  ' CLOTHESWASHER * . 

A  string  constant  is  written  as  a  list,  enclosed  by 
square  brackets,  of  constants  of  the  type  comprising  the 
string.  We  allow  an  abbreviated  notation  for  the  special 
case  of  strings  of  characters;  such  strings  may  be 
represented  as  contiguous  sequences  of  characters,  enclosed 
by  apostrophes.  Apostrophe  characters  within  such  a  string 
are  represented  by  double  apostrophes. 
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Examples;  [5,  7,  -32] 

["A",  ”B”,  "C"]  =  *ABC' 

•X* 

C  ] 

1 1 

*HELGA**S» 

Note  that  the  null  string,  when  represented  by  "[  ]•',  is 
automatically  coerced  to  be  of  the  appropriate  type  for  the 
context  in  which  it  is  used. 

A  number  of  different  storage  management  schemes 
are  available  for  representing  strings.  The  mechanisms  we 
have  implied,  because  they  impact  least  upon  the  definition 
and  use  of  strings  in  a  language  designed  primarily  for 
their  manipulation,  are  descriptor-based  dynamic  stocage 
allocation  and  regeneration  mechanisms.  Using  these 
mechanisms,  the  programmer  can,  in  general,  be  unconcerned 
with  the  maximum  string  length,  which  can  typically  be  very 
large  in  such  systems. 

3.  Sets  and  counted  sets  of  arbitrary  type.  Constants 
of  set  types  are  represented  by  lists  of  constants  of  the 
set  component  type,  and  are  enclosed  by  brace  brackets. 
Elements  of  counted  sets  are  optionally  preceded  by  the 
notation  "Expression  **»",  denoting  the  integer  multiplicity 
of  each  element.  If  the  multiplicity  is  omitted,  a 
multiplicity  of  one  is  assumed.  Set  constants  are  coerced 
to  counted  set  constants  if  context  so  demands. 
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Examples:  {5,  1,  -32) 

”B",  "C”) 

{1*7,  -4*2,  6} 

(5**5*,  2*["T",  "U"]} 

{  ) 

Sets  and  counted  sets  can  contain  zero  or  nore 
elements,  up  to  an  implementation-defined  maximum.  The 
issues  regarding  the  representation  of  sets  and  couated 
sets,  and  the  impact  of  representation  on  the  language,  are 
analogous  to  those  connected  with  strings,  and  we  therefore 
propose  that  similar  representational  mechanisms  be  adopted. 


The  cardinality  of  a  set  or  counted  set  can  be 
ascertained  by  means  of  a  primitive  cardinality  function. 
The  language  also  requires  operations  to  test  for  membership 
in  a  (counted)  set,  to  add  an  element  to  a  (counted)  set,  to 
delete  an  element  from  a  (counted)  set,  and  to  select  an 
arbitrary  element  from  a  (counted)  set.  The  last- mentioned 
operation  might  be  written  ”*Any*  • (*  Expression  *)*"»  where 
"Expression"  is  (counted)  set-valued.  If  its  argument  is 
the  empty  set,  "Any"  returns  the  value  "Undefined".  The 
language  should  contain  an  integer-valued  primitive 
operation  to  ascertain  the  multiplicity  of  a  counted  set 
element.  Although  not  primitive,  we  propose  that  the 
operations  to  compute  the  product  of  a  counted  set  and  an 
integer,  the  negation  of  a  counted  set,  the  the  additive 


union  of  two  counted  sets,  and  the  sum  or  difference  af  a 
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counted  set  of  integers  and  an  integer,  be  made  available  as 
language  primitives,  since  they  must  be  implemented  ia  any 
event  to  support  pattern  matching, 

4.  Record  types.  Because  of  their  central  importance 
in  the  definition  of  type  representations  in  cluster 
definitions,  a  flexible  record  definition  mechanism  is  an 
important  part  of  the  language.  The  mechanism  adopted  here 
is  based  on  that  in  PASCAL  [Wirth  1971],  In  their  simplest 
form,  records  are  simply  sets  of  variables  of  possibly 
different  types.  Component  variables  are  accessed  by  the 
name  gualif ication  mechanism  of  PL/I  [IBM  1972],  In 
general,  a  record  consists  of  a  fixed  part  containiig  a 
sequence  of  variable  declarations,  followed  optionally  by  a 
variant  part.  The  variant  part  consists  of  a  tag  variable, 
declared  to  be  of  any  type  for  which  constants  can  be 
expressed  in  the  language,  followed  by  a  sequence  of  variant 
formats.  These  are  sequences  of  declarations  enclosed  by 
parentheses,  and  labelled  by  constants  or  subranges  of  the 
type  of  the  tag  variable.  The  value  of  the  tag  variable 
dynamically  determines  the  interpretation  of  the  data 
structure  by  determining  the  alternative  format  in  effect. 
An  optional  alternative  labelled  by  the  special  label 
"otherwise”  is  the  alternative  in  effect  if  the  tag  variable 
has  a  value  that  does  not  label  any  other  alternative.  If 
present,  the  alternative  labelled  "otherwise"  must  be  the 
last  alternative  of  the  variant  part. 
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H-i-i  Patterns  and  Pattern  Matching 
4^4^1  Introduction 

In  incorporating  patterns  into  the  language,  we  are 
faced  with  a  number  of  problems  arising  from  the  fact  that 
patterns  exhibit  some  properties  normally  associated  with 
data,  and  other  properties  normally  associated  with 
functions.  The  functional  characteristics  of  patterns  are 
dealt  with  at  length  in  the  preceding  chapter.  In 
particular,  the  operations  of  alternation,  concatenation, 
negation,  etc.,  are  operators  on  the  functional  domain  of 
patterns.  In  addition,  although  recursive  data  types  have 
been  excluded  from  our  language,  we  wish  to  be  able  to 
define  recursive  patterns,  our  example  being 

P  =  P  &  'A*  I  'A* 

On  the  other  hand,  it  is  natural  to  identify 
primitive  string  patterns  with  the  strings  they  match,  and 
it  is  convenient  to  be  able  to  declare  variables  to  ba  of 
type  ’’Pattern"  and  dynamically  make  assignments  to  these 
variables,  as  in,  for  example,  the  SNOBOL4  statement 

P  =  P  I  INPUT 

Further,  it  may  be  convenient  to  be  able  to  pass  patterns  as 
parameters  to  procedures,  and  to  have  procedures  return 
patterns  as  results. 

One  solution,  which  is  already  excluded  at  this 
stage  of  the  design,  is  to  identify  procedures  and  data  in 
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the  language  as  a  whole,  in  the  manner  of  LISP.  We  feel 
that  this  approach  puts  severe  syntactic  and  semaatic 
restrictions  on  the  language,  and  has  little  relevance  to 
the  type  of  programs  for  which  the  language  is  designed.  In 
addition,  no  forseeable  implementation  would  meet  the  design 
objective  of  efficient  program  execution. 

A  second  approach  is  that  taken  by  SNOBOL4,  in 
which  patterns  are  a  data  type  with  functional  properties 
(although  the  syntax  of  pattern  matching  differs  from  that 
of  function  invocation) .  Recursive  patterns  must  be  defined 
with  the  help  of  a  special  monadic  operator,  '•*'»,  which 
defers  the  evaluation  of  its  operand  until  it  is  encountered 
as  part  of  a  pattern  structure  during  pattern  matching. 
Thus,  the  recursive  pattern  that  matches  any  sequence  of  one 
or  more  ’A's  would  be  written  in  SN0B0L4  as 

P  =  *P  *A*  I  *A' 

During  a  pattern  match,  the  value  of  P  (which  is 
*P  *A*  I  ‘A’)  is  substituted  for  *P.  This  technique  has  the 
disadvantage  that,  because  alternates  are  matched  from  left 
to  right  in  SN0B0L4,  a  heuristic  is  required  to  break  the 
recursive  loop  in  the  definition  substitution  process. 

A  third  approach,  which  we  propose,  is  to  recognize 
the  singular  characteristics  of  patterns  by  introducing  them 
into  the  language  as  entities  distinct  both  from  procedures 
and  data,  but  having  some  properties  of  each  concept.  In 
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this  way,  we  can  ensure  that  patterns  have  the  properties 
that  we  wish  to  give  them,  and  no  others. 

’Lt.Hi.Z  Grammar 

Definition  =  Pattern_def init ion  ; 

Pattern_def inition  =  (Identifier  |  'Reverse*  ' ('  Identifier 

* ) I )  •  =  *  Pattern  ; 

Pattern  =  Identifier  [ '  (*  Actual_parameters  *)']  | 

'Pattern*  *  (*  (Identifier  |  Expression)  *)*  | 

'Reverse*  *(*  Pattern  *)*  | 

Pattern  *  I  *  Pattern  | 

Pattern  *&*  Pattern  | 

Expression  ***  Pattern  | 

*-*  Pattern  | 

Pattern  *“' *  ; 

4^4j_3  Patterns 

Since  the  language  incorporates  strings  of 
arbitrary  type,  it  is  natuaral  to  define  patterns  to  match 
such  strings.  However,  the  language  is  strongly  typed,  so 
strings  cannot  be  identified  directly  with  patterns.  The 
operator  "Pattern”  is  therefore  provided  to  perform 
conversions  from  strings  to  patterns.  For  example,  the 
pattern  definition 

A1  =  Pattern  ('A') 

defines  A1  to  be  a  pattern  matching  the  single  character 
string  'A*. 

The  mechanism  already  exists  in  the  language  to 
define  functions  that  are,  in  effect,  patterns.  For 
example,  the  following  procedure  corresponds  to  a  pattern 
that  matches  the  string  *A*: 
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A2  =  procedure  (S :  String,  C;  Integer) 

returns  Counted  Set  of  Integer 

if  C  <  C  or  C  >  Length  (S) 
then  Frror 

if  C  ^  Length  (S) 
then  if  S (C)  =  "A” 

then  return  {C-t-l} 
end 

end 

return  {  } 

end 

Our  recursive  pattern  to  match  one  or  more  *A's  could  then 
be  written  as: 


P2  =  procedure (S:  String,  C:  Integer) 

returns  Counted  Set  of  Integer 

if  A2(S,C)  =  {  } 

then  return  {  ) 

else  return  {C  +  1}  ♦  P2(S,C+1) 

end 

end 


The  operator  ’’Pattern”  will  accept  the  names  of 
such  functions  as  arguments,  when  the  following  conventions 
defining  a  pattern  or  pattern  valued  procedure  are  observed: 


1.  The  first  parameter  of  the  function  must  be 
declared  with  the  iype  specification 

’’’String*  ['of*  Type__specification]”. 


2.  The  second  parameter  of  the  function  must  be 
declared  with  the  type  specification  "Integer”. 


3.  No  restrictions  are  placed  on  the  type 
specifications  of  optional  third  and  subsequent 
parameters. 
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4.  The  function  value  must  have  type  specification 

’’Counted  Set  of  Integer”. 

If  F  is  such  a  function,  then  Pattern  (F)  will  be  a 
pattern  if  and  only  if  F  has  two  formal  parameters, 
specifying  a  string  context.  Otherwise,  Pattern (F)  is  a 
pattern-valued  function  of  the  remaining  arguments  of  F.  ?Ln 
example  based  on  the  SNOBOL4  primitive  pattern-valued 
function,  LEN,  will  illustrate  the  latter  case: 

Example: 

F  =  procedure  (S:  String  of  *,  C:  Integer,  N:  Integer) 
returns  Counted  Set  of  Integer 

if  C  <  0  or  C  >  Length  (S)  or  N  <  0 
then  Error 

if  C+N  <  Length  (S) 
then  return  {C-»-N} 
else  return  {  } 

end 

Len  =  Pattern  (F) 

P  =  Len  (2)  |  Len (4) 

Len  is  defined  as  a  monadic  pattern-valued 
procedure  coinciding  with  F.  P  is  subsequently 
defined  to  be  the  alternation  of  two  patterns, 
Len  (2)  and  Len  ( 4)  . 

The  importance  of  this  mechanism  is,  first,  that  it 
provides  a  simple  and  consistent  means  for  permitting  the 
programmer  to  perform  a  wide  variety  of  computations  during 
pattern  matching,  and  second,  that  primitive  pattern-valued 
functions  need  not  be  specified  as  part  of  the  language. 
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This  is  in  keeping  with  the  extensibility  goal  of  the 
language,  since  as  useful  pattern  primitives  are  developed, 
they  can  be  stored  as  function  definitions,  using  the  file 
mechanism  proposed  earlier.  Note  that  if  (Pattern (F) ) is 
to  be  referenced,  the  reverse  of  F  must  be  defined  as  wall. 
The  reverse  of  any  pattern  can  be  defined  by  substituting 
"’Reverse*  *(*  Identifier  ')*"  for  "Identifier"  on  the  Left 
side  of  the  pattern  definition.  (Note  that  it  is  not 
necessary  to  define  a  pattern  before  defining  its  reverse.) 

All  of  the  pattern  operators  defined  in  the  iDdel 
are  available  in  the  language,  with  the  same  syntax.  Thus, 
just  as  the  procedure  A2  corresponds  to  the  pattern  A1  in 
the  example  above,  the  pattern 

P1  =  PI  &  Pattern (*A')  |  Pattern ('A*) 
corresponds  to  the  procedure  P2  given  above.  Specifically, 

Pi  =  Pattern(P2) 

The  advantages  of  treating  patterns  as  data  can  be 
achieved  by  allowing  patterns  to  be  used  as  type 
specifications  in  declarations.  That  is,  we  add  the 
production 

Type_specif ication  =  ’Pattern*  [’of*  Ty pe_specif ication ]  ; 

to  the  grammar.  The  optional  qualifying  type  specifies  the 
type  of  string  elements  matched  by  patterns  of  that  type. 
If  omitted,  "Character"  is  assumed. 
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Exa mple :  The  SN0B0L4  loop 

P  =  FAIL 

LOOP  P  =  P  I  INPOT  :S(LOOP) 

can  be  programmed  in  our  language  as  follows: 

P:  Pattern  initial  Fail 

do  while  -«End_of_file  (Input) 

P  :=  P  I  Pattern (Input) 

end 


4^4j_4  Pattern  Matching 

The  syntax  of  pattern  matching  is  expressed  in  the 
production 

Expression  =  (Pattern  |  Storage_ref erence)  • (*  Expression 

* , *  Expression  * ) *  ; 

in  which  the  actual  parameters  of  the  pattern  have  the  types 
"String"  (of  the  same  type  as  the  pattern) ,  and  "Integer"  or 
"Counted  Set  of  Integer",  respectively.  An  error  is 

diagnosed  if  the  arguments  of  the  pattern  match  do  not 
represent  valid  contexts,  or  if  a  type  mismatch  between  the 
subject  and  pattern  is  detected  during  pattern  matching.  The 
value  of  a  pattern  match  is  a  counted  set  of  integers, 

Pecall  that  all  parameters  in  the  language  are 
passed  by  reference.  Thus,  if  M  is  an  integer- valued 

variable  in  the  context  of  the  example  above  defining  the 
pattern-valued  function,  Len,  the  statement 

P  :=  Len(M) 

has  the  effect  of  binding  M  as  the  third  argument  of  the 

procedure  F,  which  is  invoked  when  P  is  used  in  a  pattern 
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match.  The  effect  of  this  mechanism  is  equivalent  to  the 
SN0B0L4  statement 

P  =  LEN(*M) 


The  operations  of  substring  replacement  and  value 
assignment  can  be  programmed  as  procedures  (and  filed  for 
use  as  primitive  functions,  using  the  proposed  separate 
compilation  and  file  mechanisms)  .  For  example,  a  procedure 
to  replace  the  first  occurrence  of  a  pattern  in  a  subject 
character  string  by  an  object  character  string  might  be 
programmed  as  follows: 


Replace  =  procedure (Sub ject :  String,  Patt:  Pattern, 
Object:  String)  returns  Boolean 

Post_cursor:  0. , Length  (Sub ject) 

do  each  Pre_cursor:  0. . Length (Sub ject) 

Post_cursor  :=  Any (Patt (Sub ject,  Pre^cursor) ) 

if  Post^cursor  *  Undefined 

then  Substring (Sub ject,  Pre^cursor, 

Post__cursor  -  Pre_cursor)  :=  Object 
return  True 

end 

end 

return  False 

end 

This  function  models  closely  the  SNOBOL4  pattern 
matching  statement: 

SUBJECT  PATT  =  OBJECT  : S (label) F ( label) 

The  Boolean  value  returned  by  Replace  can  be  used  in 
conditional  statements  to  direct  flow  of  control,  modelling 
the  success  and  failure  continuations  of  SN0B0L4.  Unlike 


the  latter  concept,  which  can  be  ambiguous  as  to  why  the 
statement  failed,  the  Boolean  returned  by  Replace  is  a  true 


-78- 


indication  of  the  success  or  failure  of  the  pattern  match. 
With  suitable  optimization  of  the  expression 

A.ny  (Patt  (Sub  ject,  Pre_cursor) ) 

to  match  a  single  substring,  the  procedure  "Replace'’  should 
execute  as  efficiently  as  the  SN0B0L4  statement. 

It  is  hoped  that  giving  users  the  freedom  to 
develop  their  own  pattern  matching  procedures  will  foster  a 
wide  variety  of  pattern  matching  mechanisms,  from  which  the 
most  commonly  used  can  be  selected  for  inclusion  as 
primitives  in  higher  level  string  manipulation  languages. 
For  example,  the  procedure  "Replace"  replaces  any  leftmost 
matched  substring  of  the  subject  string  by  the  object 
string.  Simple  modifications  would  permit  replacement  of 
rightmost  matched  substrings,  multiple  matched  substriags, 
and  so  on,  as  the  needs  of  users  dictate.  Improvements  over 
the  efficiency  of  such  operations  when  programmed  in 
languages  such  as  SN0B0L4  could  be  dramatic.  For  example, 
one  could  easily  write  a  procedure  to  perform  the  common 
task  of  replacing  all  occurrences  of  a  pattern  by  an  object 
string.  Such  a  procedure  would  eliminate  much  of  the 
overhead  that  is  required  to  perform  this  task  either 
iteratively  or  by  backtracking  in  SN0B0L4. 

As  another  example  of  the  flexibility  of  the 
proposed  pattern  matching  mechanism,  optimizations  that  are 
generalizations  of  that  implicit  in  the  SN0B0L4  definition 
of  alternation  can  also  be  programmed.  The 


SN0B0L4 
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alternation  operator  is  not  commutative,  with  the  result 
that  the  programmer  can  list  pattern  alternates  in  order  of 
decreasing  probability  of  successfully  matching.  This 
effect  can  be  achieved  by  writing  pattern  matching 
procedures  with  several  pattern  arguments,  or  an  argument 
that  is  a  list  of  patterns.  These  patterns  can  then  be 
matched  against  the  subject  in  any  order  desired.  In  fact, 
decisions  connected  with  the  order  of  pattern  matching  can 
be  made  dynamically,  using  information  that  may  be  made 
available  only  during  program  execution.  One  further 
disadvantage  of  the  SN0B0L4  alternation  mechanism  is  that  it 
forces  the  programmer  to  specify  the  order  in  which  pattern 
alternates  are  to  be  matched,  even  when  this  optimization  is 
inappropriate,  violating  the  principle  that  the  semantics  of 
language  constructs  should  allow  the  programmer  to  express 
his  intentions,  and  nothing  more. 

Expressions  and  Statements 

Once  the  data  types  and  operations  of  the  language 
have  been  defined,  the  specification  of  how  operations 
combine  to  form  expressions,  including  details  of  precedence 
and  associativity,  and  the  selection  of  statement  constructs 
for  the  language  combine  to  form  a  separate  level  of 
language  design.  Since  the  issues  at  this  level  have  little 
bearing  on  the  string  manipulation  aspects  of  the  language, 
we  will  leave  this  phase  of  the  design  incomplete  by  not 
further  refining  in  our  grammar  the  non- terminals 
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"Expression"  and  ‘’Statement”.  In  general,  however,  the 
expression  and  statement  mechanisms  suitable  for  a  general 
purpose  or  systems  implementation  language  will  be  suitable 
for  the  language  designed  here,  and  the  informal  syntax  used 
in  the  examples  has  been  chosen  to  reflect  this  fact.  Our 
examples  have  used  very  simple  conditional  and  iterative 
statements,  which  could  be  supplemented  by  other,  similar, 
constructs,  or  replaced  by  a  few  more  general  constracts, 
such  as  those  proposed  by  Dijkstra  [1974]. 

Miscellaneous 

The  final  phase  of  the  language  design  we  have  been 
developing  involves  the  specification  of  facilities  for 
input  and  output,  including  file  definition  and  use, 
tracing,  and  error-handling.  Although  these  features  make 
an  important  contribution  to  the  utility  of  any  string 
manipulation  language  and  system,  and  therefore  deserve 
careful  study,  the  issues  involved,  like  those  of  the 
previous  section,  are  nearly  orthogonal  to  our  principal 
area  of  concern.  Thus,  we  will  restrict  ourselves  to  a  few 
brief  comments. 

The  input  and  output  facilities  for  such  a  langaage 
must  be  string  oriented,  and  should,  in  our  opinion,  be  very 
simple  in  design.  The  I/O  facilities  of  SN0B0L4  meet  both 
these  criteria  very  successfully,  and  might  well  serve  as  a 
model  for  subsequent  designs.  In  particular,  it  should  be 
redundant  in  a  string  manipulation  language  to  design  what 
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is,  in  effect,  a  sub-language  for  the  specif icatiDn  of 
formats,  such  as  that  found  in  FOETBAN  and  PL/I.  (SNDB0L4 
does  allow  the  specification  of  FORTEAN  formats,  but  we 
suspect  that  these  were  included  as  a  bonus  of  the 
implementation,  which  uses  a  FORTRAN  I/O  package,  and  that 
they  are  dispensable  in  most  programs  that  use  them.)  The 
SN0BDL4  handling  of  files  is  also  attractively  simple  and 
flexible . 

Tracing  facilities  are  a  useful,  or  even 
indespensable  feature  of  both  very  low  and  very  high  level 
languages.  In  such  languages,  the  static  structure  of 
programs  may  be  very  complex,  or  the  static  program 
structure  may  not  reflect  very  closely  the  dynamic  structure 
of  an  executing  program.  In  both  cases,  a  dynamic  trace  can 
be  an  essential  de-bugging  tool.  In  languages  of  an 
intermediate  level,  such  as  that  proposed  here,  neither  of 
the  above  conditions  should  arise  very  often,  and  the 
utility  of  tracing  facilities  is  correspondingly  reduced. 
In  such  languages  tracing  facilities  can  often  be  eliminated 
entirely,  or  should  be  kept  very  simple  both  in  concept  and 
in  use. 

The  design  of  error-handling  facilities  is 
difficult  in  any  language,  and  constitutes  an  important 
research  area.  We  feel  that  the  resolution  of  the  ambiguity 
in  SN0B0L4  of  statement  failure  is  an  important  step  towards 
the  production  of  reliable,  transparent  programs.  It  is  a 
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conseguence  of  this  ambiguity  that  many  erroneous  SN0B0L4 
programs  have  the  property  that  some  errors  are  not  detected 
where  they  occur,  and  may  even  entirely  escape  detection 
during  execution.  A  large  class  of  errors  and  potential 
errors  arises  through  the  generation  and  use  of  undefined 
values.  Included  in  this  class  are  accesses  to 
uninitialized  variables,  and  the  use  of  results  from  some 
potentially  erroneous  operations  (e.g.,  division  by  zero). 
If  the  generation  of  "Undefined"  during  a  computation  is  at 
least  optionally  accompanied  by  the  printing  of  a  warning 
diagnostic  explaining  the  cause,  recovery  from  such 
potential  errors  can  be  programmed  explicitly  by  testing  the 
value  generated  and  acting  accordingly.  We  prevent  the 
propagation  of  any  such  errors  that  have  not  been  caught  in 
this  way  by  prohibiting  any  other  use  of  undefined  values  in 
the  evaluation  of  expressions. 
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5.  Implementation  Considerations 


One  of  the  overall  design  goals  of  the  language 
just  presented  is  that  it  be  amenable  to  efficient 
implementation.  P.ccordingly,  well-known  and  relatively 
simple  techniques  exist  for  implementing  all  phases  of  the 
design.  Of  particular  interest  is  the  implementation  and 
optimization  of  string  patterns  and  pattern  matching. 

5.5.1  Sets 

Underlying  the  implementation  of  pattern  matching 
is  the  representation  chosen  for  counted  sets.  As  pointed 
out  in  the  discussion  of  counted  sets  as  an  aggregate  type 
of  the  language,  the  representation  chosen  may  be  determined 
at  least  partly  by  language  considerations.  In  particular, 
if  the  cardinality  of  counted  sets  is  expected  to  be  large 
in  many  applications,  and/or  widely  variable,  a  descriptor- 
based  dynamic  storage  allocation  and  regeneration  mechanism 
is  appropriate.  Elements  might  be  represented  as  tuples 
containing  multiplicity  and  value  information,  and  storage 
for  the  elements  of  a  counted  set  can  be  allocated  and 
reclaimed  in  any  of  the  ways  appropriate  to  strings 
[ Kain  1969;  Madnick  1967].  These  include: 

1.  Sequential  allocation.  Elements  are  stored 
sequentially  starting  at  a  location  pointed  to  by  the  set 
descriptor.  A  "garbage  collector"  must  be  provide!  to 
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periodically  regenerate  storage.  This  is  the  mechanism  nost 
frequently  used  to  store  strings,  but  is  less  appropriate  to 


the  storage 

of 

sets  since 

additions 

and  deletions 

are 

expensive , 

and 

one  rarely 

wishes  to 

access  subsets 

that 

happen  to  be  stored  contiguously. 

2.  Linked  allocation.  At  the  expense  of  the 
additional  storage  requirement  for  links,  and  decreased 
efficiency  in  evaluating  the  membership  predicate,  additions 
and  deletions  are  much  more  efficient  when  elements  are 
stored  non-contiguously  and  linked  together  to  form  lists. 
This  is  particulary  important  in  view  of  the  desirability  of 
ordering  set  elements.  Ordering  greatly  enhances  the 
efficiency  of  evaluation  of  the  membership  predicate,  and 
hence  of  the  set  union  and  element  insertion  and  deletion 
operations.  In  addition,  determining  the  maximum  and 
minimum  elements  of  sets  of  ordered  types  represented  in 
this  way  are  trivial  operations. 

3,  Hybrids  of  the  first  two  schemes.  These  involve 
storing  sets  contiguously  where  possible,  and  providing 
links  from  the  end  of  one  block  of  storage  used  for  a  set  to 
the  beginning  of  the  next. 

Other  representations  can  be  used  in  special  cases. 
For  example,  if  it  can  be  shown  that  a  counted  set  will 
contain  only  integers  between  i  and  i+n,  and  a  bound  on 
element  multiplicities  is  known,  an  efficient  and  compact 
representation  is  the  allocation  of  n+1  contiguous  bit 
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seqaencGs,  each  sufficient  to  represent  the  multiplicity  of 
one  element.  The  j-th  such  sequence  would  contain  the 
multiplicity  of  i+j-1  in  the  counted  set. 

5*.2  Patterns 

The  pattern  implementation  techniques  to  which  the 
model  is  amenable  are  not  dissimilar  to  those  used  in  the 
implementation  of  SN0B0L4  [Griswold  1972],  In  particular, 
graph  structures  of  pattern  descriptors  provide  a  convenient 
way  to  represent  the  alternation  and  concatenation  structure 
of  a  pattern.  Using  such  descriptor  techniques,  the 
primitive  function,  "Pattern”,  when  applied  to  a  string 
argument,  would  construct  a  pattern  descriptor  identifying 
itself  as  a  primitive  string  pattern,  and  containiag  a 
pointer  to  the  string  with  which  it  is  identified.  when 
applied  to  a  function  name.  Pattern  would  construct  a 
descriptor  identifying  itself  as  a  primitive  functional 
pattern,  and  containing  a  pointer  to  the  function 
implementing  the  pattern.  Two  additional  descriptor  fields 
are  required  in  this  second  case.  One  binds  third  and 
subsequent  function  parameters,  thus  implementing 
parameterized  pattern  primitives,  such  as  the  SNDB0L4 
primitive  pattern-valued  functions  LEN,  SPAN,  BREAK,  etc.. 
The  second  additional  descriptor  field  contains  a  pointer  to 
the  function  defining  the  pattern  that  is  the  reverse  of  the 
pattern  being  defined.  Both  these  descriptor  fields  are 
left  undefined  by  "Pattern", 
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The  "reverse”  field  is  set  when  a  pattern  is 
defined  by  a  definition  of  the  form 

Reverse (P)  =  Q 

where  P  is  an  identifier  naming  a  primitive  functional 
pattern,  and  Q  is  a  primitive  functional  pattern.  If  a 
descriptor  for  P  exists,  this  definition  simply  sets  its 
reverse  field  to  point  to  the  function  defining  Q,  and  vice 
versa,  since  "reverse"  is  a  symmetric  relation.  Otherwise, 
a  descriptor  is  created  for  P  and  its  reverse  field 
initialized  in,  the  same  fashion.  The  remaining  fields  in 
the  descriptor  for  P  are  left  undefined  until  a  definition 
for  P  is  encountered.  We  know  of  no  means  of  ensuring  that 
P  and  Q  are  in  fact  mutual  reverses.  Thus,  an  error  in 
defining  reverses  can  cause  unexpected  results  during 
pattern  matching,  particularly  when  an  inverse  pattern  is 
referenced. 

The  reverses  of  primitive  string  patterns  can  be 
computed  directly,  and  reverses  of  compound  patterns  can  be 
computed  recursively  using  the  reversal  axioms  in  the  model. 

Pattern  inversion  causes  the  creation  of  a 
descriptor  containing  a  pointer  to  the  reverse  of  the 
pattern  of  which  it  represents  the  inverse.  (Recall  that 
P“i  is  defined  solely  in  terms  of  reverse  (P).) 

The  alternation  of  two  patterns  causes  a  descriptor 
containing  pointers  to  the  alternates  to  be  created.  A 
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sirailar  descriptor  is  created  to  represent  the  concatenation 
of  two  patterns. 

Finally,  patterns  of  the  form  k*P  or  -P,  where  P  is 
a  pattern  and  k  an  integer-valued  expression,  are 
represented  by  descriptors  containing  the  value  of  k,  and  a 
pointer  to  P. 

In  building  patterns  this  way,  recursive  pattern 
definitions  have  a  natural  representation  as  linked  pattern 
structures  containing  cycles. 

Note  that  there  need  be  no  distinction  in  the 
implementation  between  patterns  created  by  pattern 
definitions  and  those  created  dynamically  during  execution. 
When  a  pattern  is  used  in  the  construction  of  another 
pattern,  either  in  a  pattern  definition  or  in  an  executable 
expression,  a  copy  of  its  descriptor  is  made.  If  the 
pattern  being  referenced  is  a  primitive  functional  pattern, 
its  arguments  must  be  bound  at  this  time.  The  descriptor 
field  reserved  for  this  purpose  is  set  to  point  to  a  block 
of  storage  containing  the  values  and/or  refereaces 
implementing  the  actual  parameters  of  the  pattern. 
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ExamEi®*  The  statement 

P  :=  Len(M) 

assigns  to  P  the  following  descriptor: 


I - > 

I 

I  P  f 
I - 1 — I — r - r - 1 

I  I  •  I  •  I  - > 

I - L - 1 - 1 - L - 1 

t  r  \ 

I 

T 

I - 1 

I  •— I - >  H 

I _ I 


where:  t  identifies  the  type  of  pattern  descriptor. 

r  points  to  the  function  defining 
reverse (Len) . 

p  points  to  a  location  containing  a 
reference  to  M. 

f  points  to  the  function  defining  Len.  (In 
this  case  f  and  r  point  to  the  same 
function. 

5 .3  Pa^ern  Matching 


The  process  of  pattern  matching  can  be  very  simple, 
or  very  involved,  depending  upon  the  degree  of  optimization 
desired.  The  tree  representation  of  patterns  just  described 
is  particularly  amenable  to  a  simple  recursive  pattern 
matching  algorithm.  As  an  example,  recall  that  to  match  the 
pattern  (P  |  Q)  in  the  context  (S,c),  one  need  only  match  P 
and  Q  separately,  and  form  the  additive  union  of  the  two 
results.  Even  the  iterative  definition  mechanism  for 
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recursive  patterns  could  be  implemented  directly,  at 
considerable  cost. 

The  pattern  matching  heuristics  and  backtracking 
mechanisms  that  combine  to  define  SN0B0L4  pattern  matctiing 
can  be  applied  to  the  optimization  of  pattern  matching  under 
the  model  we  have  defined.  The  principal  opportunities  for 
optimization  arise  from  the  fact  that  it  may  rarely  be 
necessary  to  generate  all  the  post-cursor  positions  in  rhe 
result  of  a  pattern  match.  That  is,  we  suspect  that 
expressions  such  as  ”A.ny  (P  (S,c) )  ”  will  occur  predominantly 
in  programs  using  the  model.  The  potential  for  optimization 
of  expressions  of  this  type  arises  in  all  languages  having 
set  data  types,  and  research  in  these  more  general  contexts 
can  be  effectively  used  in  the  special  cases  encountered 
here.  Some  current  techniques  are  discussed  in 
[  Schwartz  1974]. 


In  comparing  the  efficiency  of  pattern  matching  in 
the  model  presented  here  with  that  of  the  SN0B0L4  model, 
certain  definitional  differences  must  be  taken  into  account. 
For  example,  an  optimization  problem  arises  in  patterns 
containing  negative  alternates.  In  those  branches  of  the 
pattern  structure  containing  negative  alternates  the 
possibility  of  cancellations  precludes  optimizations 
designed  to  avoid  matching  alternates  remaining  after  a 


successful  match 
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Commutativity  of  the  alternation  operator  in  the 
present  model  precludes  the  optimization,  available  to 
SN0B0L4  programmers,  of  ordering  alternates  by  likelihood  of 
successfully  matching.  In  section  4.4,4,  however,  it  is 
argued  that  this  optimization  should  not  be  a  part  of  the 
pattern  specification  process,  but  rather  should  be  deferred 
and  programmed  explicitly,  when  it  is  appropriate,  as  part 
of  the  pattern  matching  process. 

On  parallel  and  vector  computers,  commutativity  of 
alternation  becomes  an  advantage,  and  the  need  for  the 
optimizations  discussed  above  vanishes  since  alternates  can 
be  matched  in  parallel  against  a  subject  string.  Since  the 
order  in  which  alternates  are  matched  is  important  in 
SN0B0L4,  there  is  very  little  opportunity  to  take  advantage 
of  hardware  parallelism,  and  on  such  machines  the  model  we 
have  proposed  may  very  well  prove  to  be  the  more  efficient. 
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§*.  §.11^  Directions  for  Future  Research 


We  have  argued  throughout  this  thesis  that  an 
algebraic  model  of  string  patterns  offers  sabstantial 
advantages  over  existing  interpretive  models  to  both 
language  designers  and  programmers.  The  test  of  these 
hypotheses  must  await  the  design  of  a  programming  langaage 
incorporating  the  model,  and  a  compiler  for  that  langaage. 
Although  a  considerable  portion  of  this  thesis  is  devotai  to 
a  discussion  of  language  design  and  implementation  issues,  a 
major  design  and  implementation  project  would  necessarily 
touch  upon  a  great  many  specific  and  detailed  questions  not 
dealt  with  or  even  forseen  here,  A  number  of  unresolved 
problem  areas  deserve  mention: 

1,  The  mechanism  for  defining  abstract  data  types 
presented  in  section  4  is  very  primitive,  and  serves  only  to 
indicate  an  approach  to  the  problem,  A  realistic  solution 
would  involve  a  detailed  survey  of  current  research  in  the 
area,  and  an  analysis  of  the  data  abstraction  mechanisms 
most  appropriate  to  string  manipulation  problems. 

2,  The  dual  nature  of  patterns  as  entities  with 
characteristics  of  both  procedures  and  data  forces  a  namber 
of  awkward  design  decisions.  Our  treatment  of  parameterized 
patterns  is  particularly  inelegant. 
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3.  The  selection  of  a  set  of  primitive  patterns  for 
inclusion  in  a  string  manipulation  language  design  is  a 
difficult  process.  We  have  suggested  an  approach  to  this 
problem  that  we  feel  will  be  effective,  but,  due  to  its 
iterative  nature,  slow.  The  experience  gained  from  existing 
string  manipulation  languages  will  be  very  helpful  in  this 
regard. 


4.  We  have  said  very  little  about  the  difficult  area 
of  error-handling.  Of  particular  interest  would  be 
techniques  to  aid  the  programmer  in  detecting  errors  made  in 
defining  pattern  reverses. 

5.  Interpretive  pattern  matching  models  enjoy  the 
advantage  that  they  can  be  designed  with  efficient 
implementation  as  an  explicit  design  goal.  We  have 
discussed  techniques  for  the  optimization  of  implementations 
of  the  algebraic  model,  and  pointed  out  that  it  may  pcove 
better  adapted  to  parallel  machine  architectures  than 
interpretive  models  designed  for  sequential  machines,  but 
detailed  research,  preferably  in  the  context  of  an 
implementation  project,  remains  to  be  done  before  specific 
efficiency  claims  can  be  made, 

6.  No  formal  techniques  exist  for  proving  or 
disproving  that  for  a  given  pattern,  P,  and  pattern 
procedure,  F, 

P  =  Pattern  (?) 


-93- 


where  we  have  used  the  notation  of  section  4.  Such 
techniques  would  be  valuable  in  proving  the  correctness  of 
efficient  implementations  of  recursive  pattern  definitions. 
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